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ROBBINS-MONRO PROCEDURES FOR 
TAILORED TESTING! 


FREDERIC M. LORD 
Educational Testing Service 


WHEN computers are used in the schools for instructional pur- 
poses, it is a matter of convenience to use them for measurement 
purposes also (Turnbull, 1968). The problem of securing accurate 
measurements is different from the problem of giving effective in- 
struction. This paper is concerned entirely with measurement and 
not at all with instruction. 

In tailored testing, an attempt is made to tailor the difficulty of 
the test items administered to the “ability” of the individual being 
tested. For most purposes, it will be convenient here to think of the 
problem of testing or “measuring” just one single individual. 

A large pool of items must be available at the start of the testing. 
The statistical characteristics of these items must be known from 
earlier testings. 

If an examinee answers all n items in a test correctly, we are not 
able to pinpoint his ability level; for example, we can not tell how 
he compares with some other examinee who also answers all items 
correctly. A similar conclusion applies if the examinee does not 
know the answer to any of the test items. Other things being equal, 
the best measurement is obtained when the examinee knows the 
answer to roughly half of the items administered. In tailored test- 
ing we try to choose items for administration that are at a difficulty 


init Soe t 
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level that matches the examinee’s ability, which we infer from his 
responses to the items already administered. A convenient, if over- 
simplified, rule for doing this is that when the examinee gives a 
wrong answer to an item, the next item administered should be an 
easier one; when he gives a correct answer, the next item admin- 
istered should be harder. More complicated rules can be investi- 
gated (for example, see Wetherill, 1963; Wetherill and Levitt, 1965), 
but this will not be done here. 

Certain questions still remain to be answered before we can actu- 
ally start testing: 


1. What should be the difficulty level of the first item? 

2. How much should the difficulty level be changed after any 
given right or wrong answer? 

3. How should the examinee’s responses be scored? 


4. How should the effectiveness of various possible procedures 
be compared? 


We are not presently able, either theoretically or practically, to 
provide fully optimal answers to most of these questions, except 
for tests too short to be of much practical interest. So little is known 
about the answers to these questions, however, that we can learn 
much simply by trying out various plausible procedures and ex- 
amining the kind of results obtained. This is the approach adopted 
here. Since convincing field studies (see Linn, Rock, and Cleary, 
1969) of all the various plausible procedures would be impossibly 
extensive and expensive at this point, the procedures are here ex- 
amined by test theory methods rather than by actual experimental 
test administrations. 

ш order to make any progress, we must somehow be able to pre- 
dici how an examinee will perform on a new item, even if this item 
is ata different difficulty level from any to which the examinee has 
Previously responded. To do this, we need to make use of item 
characteristic curves (see Birnbaum, 1968). For simplicity, it is as- 
sumed here that the available items differ from cach other in diffi- 
culty but not on other statistical parameters. 


Item Characteristic Curves 


E characteristic curve of an item represents the probability of 
€cess on the item as a function of the ability level, 0, of the ex- 


P.) 
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aminee. Here, we will consider only the case where the item char- 
acteristic curves are all of the form 
Р == Р(0, 0) == c+ (1 — à) Halo – 5), (- е <0 < ә), 

(1) 


where the symbol = = is used to indicate a definition; a, b, с are 
parameters describing the item; and Ф represents the normal ogive 
function 


a == | 04, Q) 
where ¢ is the normal curve ordinate 
oy) == ele енун, (3) 
V 2r 


Several such curves are shown in Figure 1. 


When the item cannot be answered correctly by random guessing, 
с = 0, and (1) becomes the usual normal ogive curve; the parameter с 


Figure 1. Normal ogive item characteristic curves. 
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culty score for large n is approximately normally distributed with 
variance of order 1/n and expectation 6* = 6 + constant. In the 
present problem, 


e == 11s (1-1). (9) 


If y = %(1 + c), then 0* = 6, the final difficulty score is a con- 
sistent estimator for 0, and its approximate large-sample variance 
is minimized by choosing 


_ Viz 
FEE a 

in which case this variance is 
Var бы = ET a) 


In contrast to the above, the results to be presented in this paper 
are small-sample results. The foregoing was presented here as back- 
ground for the decision to limit investigation of shrinking-step- 
size procedures mainly to Robbins-Monro processes satisfying (8), 
with 0,41 recorded as the examinee's score. 


Evaluation 


In many problems of statistical inference, we require а consistent 
estimator, preferably one with the smallest possible sampling var- 
iance. Although estimating the examineo's ability level is definitely 
а statistical inference problem, neither of these requirements is 
appropriate here, 

i Erosion (9) tells us that if y = 3(1 + с), then 6,4, is not a con- 
sistent estimator of 6, Instead, it may be a strongly biased estimator, 
even when n becomes indefinitely large. However, the bias of Dn+ı 
m aa " | examinees tested. A constant bias does not 
is important iy pe sis of examinees, and this is usually all that 
їз: Bakes in mental testing. Thus, there is usually no need for 

Seek either a consistent or an unbiased estimator of 0. (As à 


Consequence, the present investigati i 
é gation need not be restricted to any 
particular value of y.) $ 


Actually, in most 
measur 1 i for 
А RT ement situations the seale chosen 


bof eee quite arbitrary. If 8 in (1) is replaced by 0* = 6 oF 
= V 8, we have a new three-parameter family of item char 
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acteristic curves with ability now measured along a 6* scale. In 
general, there is no good reason to assert that the 0 scale provides a 
“truer” measure of ability than does the 6* scale. Because of this fact, 
comparisons between methods for estimating examinee ability must be 
invariant under any monotonic transformation of the scale used to 
measure ability. 

Here we will describe the effectiveness of any score z for measuring 
the ability 0 by means of the information function of the scoring formula 


à 2 
[2 &(x | o] 
ө == отт не 
(Birnbaum, 1968, p. 453). The numerator here is the squared slope 
of the regression of test score on 6; the denominator is the condi- 
tional variance of the test score for fixed б. 
The efficiency of scoring method 22 relative to scoring method 
7; is given by the ratio 


RE == BO. аз) 


li is easily verified from (12) that the relative efficiency of two 
methods remains the same whether ability is measured by 6 or by 
any monotonic transformation 0* (0). This is the required invari- 
ance property. 

Since we will not be assuming large n, we will not consider here 
the asymptotic properties of I,(@). The meaning and justification 
of I,(6) as used here will be described by paraphrasing Mandel 
and Stiehler (1954) : 


If it is desired to differentiate between two nearby values, #' and 
6”, by means of the corresponding measurements 2^ and z^, it is 
apparent that the success of the operation will depend on two 
circumstances: (1) the magnitude of the difference г -€== 
€ (z" | 6”) — E(x | &') for a given difference 9" — @',1.е., the mag- 
nitude of the slope (€” — €’)/(6” — 6); and (2) the precision of 
measurement Var(r|6). These two desiderata can be com- 
bined in a single criterion, Z,(0), defined as the ratio of the 
Squared slope to Var (x | 6). 


Itis helpful to visualize the situation with the aid of a small diagram: 
A more formal discussion of the small-sample interpretation is given 
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g' e" 


by Lord (1952, pp. 21-25). (The term "information function" may 
be misleading if T,(8) is used without its asymptotic properties. It 
does not seem wise to try to rename I, (0) here, however.) 

In order to have some idea of the meaning of any particular value 
of I, (6), it is helpful to compare it with values of I,(0) characteriz- 
ing some “standard” test. The standard tests used here for such com- 


parisons are n item conventional tests, composed of statistically 
equivalent items, administered in 


the usual way, the examinee’s 
Score being the number 


of items he answers correctly. nese 
tests on which there is no guessing are compared to a standard tes 


on which c = 0; tailored tests with c = .20 are compared to & 
standard test with c — 20, 


It may help to note that for a group in which 6 is normally dis- 


tributed with zero mean and unit variance, the 60-item standard 
test with a = .50, b = 0, and c = 0 will have a parallel-forms reli- 


ability of about .90, For tests like the standard tests, I, (0) is found 
from (12) to equal 


1 =, шш 


Where P is given by (1), P’ is its derivative with respect to 6, and 
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Q = = 1 — Р. The reader may wish to look at Figures 2 and 3: in 
each, the tallest curve shows I,(6)/a? for an appropriate standard 
test. 

For the standard tests, I.() and I(0) are proportional to test 
length, n. It is very helpful for understanding the meaning of I, (0) 
to think of a k fold increase in 7, (6), for given 6, as equivalent to 
the increase in information gained by a k fold lengthening of a con- 
ventional test. 


Computing the Information Function 


Let us assume that bı, the difficulty of the first item administered, 
is the same for every examinee. When not otherwise stated, we will 
set b; — 0. Under Robbins-Monro procedures, n subsequent item 
difficulty parameters are determined from the examinee's responses 
in accordance with (7). The constants y and dj, ds, - * *, d, in (7) are 
selected in advance of the testing. 

There are 2” different possible patterns of item response и == 
{th, us +++, Un}. The probability of any particular pattern is 
simply 


ja (9 208 I PQ)“, (15) 


where P, is the probability of success on the ith item administered, 
as determined from b, and 6 by equation (1). Since each pattern 
u uniquely determines a final difficulty score z = 5543, (15) readily 
Provides the conditional frequency distribution of final difficulty 
Score for given 6, denoted by f (2]8). 

If n is 10 or 12, all 2^ values of (15) can be calculated by a com- 
puter for any chosen value of 6. This gives f(z|6), from which 
€(z|6) and Var(z|8) are readily computed. This is repeated for 
various values of 6. The numerator of (12) can be computed by а 
Tecursion relationship, which need not be written out here. 

When more than 12 or so items are administered, this brute-force 
method cannot be used. The results reported here were obtained 
by minor modifications of an ingenious method of Cochran and 
Davis (1965, p. 32), involving n successive Lagrangian interpola- 
tions. The numerator of (12) was computed for desired values of 
? by evaluating the derivative of the appropriate Lagrangian inter- 
Polating polynomial. 
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Computations were done by a computer program devised 
Martha Stocking. Final results were checked against brute-f 
results for small n and, in certain cases, against earlier results ( 
1971) for large n. Accuracy of repetitive interpolation, also ch 
by reruns with different interval widths, was found to be excell 


Choice of Initial Step Size When There 
Is No Random Guessing 


In tailored testing, item difficulty is adjusted by successive 8 
in an attempt to match the item diffieulty to the ability level of 
examinee. If the steps are too small, it may take too long to reach & 
difficulty appropriate for a high-ability or low-ability examinee. If 


difficulty level can be maintained, even if once achieved. The big 
theoretical advantage of a shrinking-step-size procedure is that it 
uses a large step size at first when the items may be poorly matched: 
to the examinee’s ability, and progressively smaller step sizes as the 
match is improved, avoiding both the above-mentioned problems. 

In (10), an asymptotically “optimum” value of d, = V27/a was 
mentioned for с = 0. However, this value was chosen to minimize 
the asymptotic sampling variance of the final difficulty score, without 
reference to the information function 1.(0) that is of concern here. 
j Figure 2 shows a plot of the information function for each of five 
initial step sizes, determined by d, = 1/a, 2/a, 4/27/a, 3/a, and 
5/а. The stepping procedure is Robbins-Monro as defined by (7) 
and (8). The score is the final difficulty score z = b,,,. All five 
have п = 60, = 0, and y = 5, ! 
"=: figure is labeled so that it can be used for any choice of 
initial item difficulty b, and for any value of the item discrim- 


to 6 and 
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example, if we are chiefly interested in examinees in the range 
—2 < 6 < 3, then we need look at 1,(6)/a* in Figure 2 only for 
those portions of the curves in the range а(—2 —b,) < a(@ — b) 


nin 


оѓ 
40 


—— 


— 


(8-b) 
-2 4 o ! 3 : 


Figure 2. Index of discrimination for five Robbins-Monro tailored testing 
Procedures with л = 60, c = 0, and у = 5. 
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< «(8 ~ b). Ha = Sand b, = 0, we will want only the range —2 
sisi 


Some of ihe conclusions suggested by Figure 2 are: 

1. A poor choice of initial step size can seriously reduce the 
value of the test results. 

2. An initial step size determined by dı = 2.5/a or 3/a scems 
nearly optimal. 

3. Even foritests with unusually high a, the choice of initial step 
tise, surprisingly enough, will not usually depend appreciably 
on the range of talent to be tested. 

4. The best of the tailored procedures shown are almost as good 
as the standard test for examinees at 0 = b. Specifically, the 
best 60-item tailored procedure shown is about as good at 
@ = b, as а 57-item standard test. 

5. The tailored procedures provide good measurement for a much 
wider range of examinee ability than does the standard 


test. 

6. The procedures shown provide excellent measurement at ex- 
treme levels of að, where measurement is not required. This 
suggests that some other tailored procedure will be found that 
Will have better measurement properties near 9 = b, without 
serious loss within the range of 6 that is usually of practical 


The reader should remember not only that the foregoing con- 


— liv» but that they apply only to the special cir- 


1. The item characteristic curves are given b 
у (1). 
2. The test items differ in difficulty, but not on other parameters. 


3. The sequence of item di ies i Ў 1 
Cin difficulties is determined by (7) and (8) 


5.ez0. 
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f Choice of Offset and of Initial Нет Difteulty 


The parameter y in (7) will be called the offset. Ti determines 
_ _ tthe relative size of upward and downward steps. When items tas- 
Bot be answered correctly by random guessing, optimal measure. 
ment for an examinee with ability & requires а test composed en» 
tirely of items with b = A. Thus when there is no guessing, eym- 
metry considerations require in tailored testing that upward asd 

downward steps be of equal size for fixed 1; that is, thay = M. 
When random guessing occurs, с ¥ 0, and 1„(#) no longer has 
the symmetry around @ = b, that is apparent in Figure 2. Figure 3 
illustrates the effects of different choices of offset when с = 20, 

_ Some of the conclusions suggested by these resulta are 


: 1. Random guessing may substantially reduce the effectiveness 
' of a test as а measuring instrument. 
2. An offset of y = „5 is unsatisfactory when e = 20. 
3. When y = 34, the information function is nearly fal over а 
wide range: good measurement is obtained over tbe range 
-1 $ a(6 — b) <4. 
4. When y is reduced from 35 to .6, more information сап be 
obtained, although for a narrower range of examinees. 
5. Where there is random guessing, the difficulty of the first 
item should ordinarily not be set at b, = 0, that is, at the mean 
| value of 6 for the group tested; instead it should be easier than 
this. 
6. The effectiveness of the test may depend markedly on the 
choice of difficulty level for the first item. 
_ These last two points are important and require further explanation. 
If there is no random guessing, and if we are equally interested 
in good measurement оп both sides of the mean (Le. above and 
below @ = 0), then, because of the symmetry in Figure 2, we will 
Ordinarily choose a first item with b, = 0. When there is random 
Guessing, however, the function J,(6) will not be symmetric with 
| Tespect to 0 = 0. In this case, we will usually choose b, so as to ob- 
| tain maximum information over some range of ability. The choice 
| Of b, » 0 does not change the I, (6) curve; it only changes the inter- 
Pretation of the base line on which it is plotted. 
For example, Figure 3 shows that when y = % we will have 
good discrimination over the range —1 < «(6 — b) S 4 If 
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а = 1, the choice of b, = —1 makes this range -1 < 0 +1 < 4 
or—2 5653. 

Figure 3 shows that still more information than this can be 
obtained by setting y = .6. In this case, best results are obtained 
for an interval such as 1 S a(@ — b,) < 4. If a = 16, the choice 
of b, = —5 will give good measurement in the range 1 < 344 + 
%4 <4ог—3<0<3. 

At first sight, this last result seems so strange as to require some 
rationalization. It shows that if a = .5, best results are obtained by 
setting y = .6 and making the first item so easy (b, = —5) that 
virtually everyone will answer it correctly! The following table 
shows Р(0, —5), the probability of answering this item correctly, 
for different ability levels: 


@= E <1), 0 
P(9,—5) = 87 95 .98 .995 


The following rationalizations may help to explain the foregoing 
result. The first “step” taken, under (7) and (8), is twice as long as 
any of the others; therefore, it is very important that it should not 
be taken in the wrong direction. If an examinee does not know the 
answer to the first item, he may get it right by random guessing; 
if this happens, the first step will be a large one taken in the wrong 
direction and the second item administered will be still harder than 
the first. In order to avoid this, the first item is made so easy (Ь = 
—5) that very few examinees will resort to guessing. 

The effect is almost as if the first item had been omitted from 
the test altogether, and all examinees had started with b; = 56 
+ 62(1 — .6) = — 2.52, subsequent items being chosen according 
to (7) with d; = 6.2/(i + 1), = 1,2, ++- , n. This last sequence 
of d; converges only about half as fast as (8). 


Choice of Initial Step Size : 
When There Is Random Guessing 
In Figure 3, all the tailored tests had an initial step size determined 
by d; — 3.1/a. Here let us consider whether some other d, would 
be preferable (we will not attempt to locate the optimal value with 


any great accuracy, however). я 1 
Figure 4 compares five different initial step sizes for 60-item tai- 
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lored tests having с = 20 and y = 2/3. This figure suggests the 


following conclusions: 


l. If the length of the first step is badly chosen, measurement 
may be seriously impaired. 

2. Good measurement is obtained if d, is in the range 3 < ad; 
S 5. 

3. The larger the initial step size, the larger the range of effec- 
tive measurement. 

4. The larger the initial step size, the easier the first item should 
be. 

5. If a = .5, a choice of d; = 8.0 and of bı = —3 gives good 
measurement when —1 < .5(0 + 3) < 4, that is, for —5 
<0<5. 

6. The foregoing 60-item tailored test is about as effective for 
all examinees in the range —3 < 0 < З as a 55-item standard 
test is for those examinees it measures best. 


When у = .6, the curves for the information function (not shown 
here) tend to be less flat and skewed to the left. The curve for 
ad, = 3.1 and y = .6 is barely higher in the range 2.5 < 0 < 3.5 
than the curve for ad, = 4.0 and y = %, but it is not as high 
elsewhere. 


Fired vs. Shrinking Step Size 


Lord (1971) investigated fixed-step-size procedures, item difficulty 
being determined by (6). When there is no guessing, the “best” 60- 
item fixed-step-size procedure found uses a step size of d = .20/а. 
The information curve for this procedure is shown in Figure 5, labeled 
ad = .20. Other curves are shown, for a fixed step size of ad = .05 
and for ad — .50. 

One of the best curves from Figure 2 is copied in Figure 5 for 
comparison. This is the curve for the Robbins-Monro procedure with 
ad; = Vr, so labeled here. It seems clear that in this case, at least, 
the Robbins-Monro procedure is better than any of the fixed-step- 
Size procedures studied. : 

When step size is fixed, use of final difficulty score bası is often 
unsatisfactory: when step size is fixed, such а score can assume only а 
particular set of more or less widely spaced values and thus cannot 
approximate 0 as closely as might be desired. It is usually better to 
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score the test by adding up (or averaging) b, values. The result is 
called the average difficulty score, defined precisely by 


z= 1 Э, bi. (16) 
Note that the first item is omitted from the summation (since it is 
the same for everyone), but that the hypothetical (n + 1)st item is 
included. [A slightly superior method of scoring found by Wether- 
ill, Chen, and Vasudeva (1966) has not been investigated for tai- 
lored testing. Its use should not materially change the conclusions 
reached here.] 

For ad = .05 and for ad = .50, Figure 5 shows information curves 
both for average difficulty score (marked A) and for final difficulty 
score (marked F). Clearly, when n = 60, the average difficulty 
score is superior when step size is fixed. 

The conclusions already drawn from Figure 5 may be typical 
of results when с = 0 and n is large. A look at Figure 6, which dis- 
plays results for some 10-item tests, suggests caution against over- 
generalization, however. 

When п = 10, a step size determined by ad, = 3 was found to be 
“best” among the Robbins-Monro procedures tried. The informa- 
tion curve for this procedure is shown in Figure 6, labeled ad; = 3. 
The other curves shown represent fixed-step-size procedures cor- 
Tesponding to those displayed in Figure 5 for n = 60. For @ near 
zero, some of the fixed-step-size procedures are here seen to be 
better than the best of the Robbins-Monro procedures. 

It would be good to have some results evaluating the use of av- 
erage difficulty score in conjunction with Robbins-Monro stepping 
Procedures. Also for various weighted average difficulty scores, 
such as, for example, 


att 


n(n - n(n 4- 3) 2 йн 


Unfortunately, for n > 10 there do not seem to be any و‎ 
methods for computing the expectations and sampling variances 
Necessary for evaluating such scoring methods. 


2 = 


Item Economy 


In planning tailored testing, one must have items available for 
each of the possible “paths” е the examinee may follow as a re- 
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Тетте 6. Comparison of fixed-step-size procedures with one of the best 
res procedures. л = 10, e = 0, у = 50 (А = average difficulty 
score, F — final difficulty score). 


Xie the stepping procedure. In the case of Robbins-Monfo 
dut EE. step size governed by (8), one needs to have tw? 
{бие ае for administration following the first item, 
eit ioe 1 available for administration following the se 

m, and so forth. A total of 2^ — 1 items are needed, 2 · 
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theory, before testing can be begun. Even for п = 20, this would 
be more than a million items. 

For such shrinking-step-size procedures, the number of items 
needed is represented by a geometric series in powers of 2 with n 
terms. For fixed-step-size procedures, however, only an arithmetic 
series is involved, so that only n(n + 1)/2 items are required. For 
n = 20, 210 items are needed in theory when step size is fixed; for 
п = 60, 1830 items. This number can be greatly reduced by certain 
obvious shortcuts and approximations, 

These figures suggest that for n > 6, say, a strict use of Robbins- 
Monro methods is impractical because of the number of items re- 
quired. An obvious suggestion is to approximate the steps required 
by (8), while using only steps that are a multiple of some prechosen 
Minimum step size, denoted here by A. If A is large enough, the 
number of items needed is much reduced. 

For the case where there is no guessing, Figure 7 compares in- 
formation curves obtained for two different values of A with those 
Obtained for other procedures already studied. The top curve is the 
same Robbins-Monro process used as a standard of comparison in 
Figure 5. The curve just below it is the same process, modified with 
А = 05/a. The bottom curve, labeled ad; = 2.6, аА = 20, is also 
the same process, modified with A = .20/a. The values of ad, are 
compared in the following table for these three curves: 


t= 1 2 3 10 30 60 
eq. (8) 251 135 0.84 +++ 0.25 +++ 0.08 --- 0.04 
А = .05/a 2.5 1.25 0.85 +++ 0.25 «++ 0.10 +++ 0.05 
А = .0/a 26 1.2 0.80... 0.20... 0.20... 0.20 
For comparison, Figure 7 repeats some of the curves from Figure 
5, representing fixed-step-size procedures. The figure suggests the 
following tentative conclusions. 

1. When A is not much larger than d,/n, modification of the 
Robbins-Monro procedure does not cost much in terms of 
measurement efficiency. However, this is the case where the 
modification gains little in item economy. : 

2. When A is bs larger than d,/n to economize effectively 
on items, the suggested modification of the Robbins-Monro 
process causes an unacceptable penalty on the efficiency of 
measurement. 
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for the examinee. Why, then, does this procedure show up so 
poorly? The answer may be that the poor results are due to the use 
of final difficulty score. If average difficulty score were used with 
this stepping procedure, better measurement might result. Unfor- 
tunately, this possibility cannot be checked here, because of the 
difficulty in computing information curves for this scoring pro- 
cedure when the d, are unequal. 


Hybrid Procedures 


Several writers have suggested use of fixed-step-size procedures 
preceded by one or more large initial steps. This seems a promising 
approach, for reasons mentioned in the last section. А computer 
program was written to investigate the following hybrid procedure: 


1. First take п, steps of decreasing size, as determined by (7) 
and (8); 

2. then take n — n; steps of fixed size, d, determined by (6). 

3. Compute an average difficulty score from the last n — т 
items only. 


For the case where there is no guessing, Figure 8 shows the effective- 
ness of measurement obtained by three such hybrid procedures, 
together with results for two fixed-step-size procedures, for com- 
parison. 

In general, the effect of the hybridization here seems to be to 
improve measurement at extreme values of 6 at the expense of 
measurement around 6 = 0. If a = .5 and b, = 0, the simple fixed- 
step-size procedure with ad = .20 is better than any of the hybrid 
procedures investigated within the range —3 < 8 < 3. 

The curve for the hybrid procedure with ad, = 1, m = 1, and 
ad = .40 is not shown in the figure because it almost entirely coin- 
cides with the fixed-step-size curve for ad = .40. Thus for this 
fixed step size, such hybridization is neither helpful nor harmful. 
The results are still inferior for most purposes, however, to those 
obtained with a smaller fixed step size. 

Figure 9 shows results obtained for various hybrid methods when 
tandom guessing occurs. Here, there is little to choose between cer- 
tain of the hybrid procedures and the comparable pure fixed-step- 
size procedure. Hybridization seems to offer no distinct advantages, 
however, 
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Many other hybrid procedures would be of interest (see We 
1963). No others are considered here. 


Summary and Conclusions 


An earlier study (Lord, 1971) investigated various fixed-st 
size methods for tailored testing. It seems plausible that shrinki 
step-size methods might be preferable. These allow rapid matchi 
of item difficulty to the ability level of the examinee initially, wh 
his level is very poorly known; but close matching later in Ё 
testing, when his level can be inferred with some accuracy 
responses to items already administered. 

Robbins-Monro procedures аге shrinking-step-size ргосейш 
in which the final score of the examinee (called the final difficult 
score) is the difficulty level of the item that would be administel 
next if the testing were continued. Other scoring methods woul 
no doubt be preferable for certain of these procedures, but cor 
putational difficulties have prevented any extensive investigatk 
of them here. 

The Robbins-Monro procedures studied here have a harmoni 
sequence of step lengths, or at least an approximation to this. The 
is no firm basis for this choice; other sequences also deserve ii 
vestigation. | 

Tailored testing methods involve many different parameters: 


1. examinee ability level (0) 

. test length (n) 

. item difficulty (b, i = 1, -++,n) 

. item discriminating power (a, = 1, +++, n) 

. item guessing parameters (c; à = 1, ++, n) 

. offset (y) 

. parameters controlling step size (d; i = 1, ++, n). 


Ww м‏ + © ي ډه 


In addition, there is a virtually unlimited choice of scoring method 
With so many variables to control, it is difficult to reach any reall 
firm conclusions about the general merits of various procedures 0 

the basis of a few illustrative specimens of each. The following li: 
includes tentative conclusions at various levels of generality. 
yeader is cautioned that the conclusions are based on very limite 
investigations; many of them cannot be expected to hold for oth 
circumstances. 
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Figure 8. Comparison of certain hybrid procedures with the corresponding 
fixed-step-size procedures, n = 60, c = 0, y = 50. 


1. When there is no random guessing, the best 60-item Robbins- 
Monro procedure studied is about as good at the optimal 
difficulty level for effective measurement as а conventional 
(peaked) test of about 57 items. 
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2. Tailored procedures provide good meamurement over a much 
wider range of examinee ability than do typieal conventional 
testa. 

3. The Robbins-Monro procedures studied tend to provide 
good measurement at ability levels not encountered in ртм» 
tice. This is paid for by only slightly impaired measurement 
at more usual ability levels, 

4. The length of the first step in a Robbins-Monro procedure is 
unimportant within a considerable range of tolerance, bat ls 
quite important beyond this range. 

- The following tentative conclusions relate to 60-item Robbias- 

procedures when there is roughly а 20 per cent chance of 

Answering any item correctly by random guessing: 


5. Symmetrical treatment of right answers and wrong answers 
is not adequate. Upward steps must be shorter than down- 

-. ward steps in order to compensate for chance success. 

6. The effectiveness of the procedure may depend markedly on 
the difficulty level of the first item administered. For eram- 
ple, it may be important that this item be answered correctly 
by virtually everyone. 

7. The best 60-item Robbins-Monro procedure studied is about 
as good at the optimum difficulty level for effective measure- 
ment as a conventional (peaked) test of about 55 items. 


| The following tentative conclusions compare the 60-item Robbins- 
i procedures studied with other 60-item procedures: 


8. Fixed-step-size procedures provide almost as good measure- 
ment as the Robbins-Monro procedures studied and do sot 
require nearly so many items. 

9. Strictly speaking, any Robbins-Monro procedure requires 
the availability of 2* items. This is likely to be uneconomical 
for n > 6, say. 

10. One shortcut, used to reduce the number of items needed to 

reasonable limits, destroyed the measurement effectiveness 

_ of the Robbins-Monro procedure. It might be possible to 

` regain most of this loss by changing the scoring method. 

Computational complications prevented further investigation 

n of this possibility. 


SS 
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Certain hybrid procedures were investigated: a Robbins-Monro 
process with large step sizes was used for n; initial steps, after which 
& fixed-step-size procedure was used with smaller steps. 


11. Hybridization improved measurement at extreme ability ley 
els, but often at the cost of impaired measurement for levels 
usually encountered. For typical items, the hybrid procedures 
studied showed no advantage over fixed-step-size procedures 
for virtually all examinees in a typical group. 


To summarize, shrinking-step-size procedures have certain ob- 
vious advantages over fixed-step-size procedures. However, if more 
than six or seven items are to be administered to an examinee, 
item pool required by the shrinking-step-size procedures is so large 
as to be prohibitive. Certain obvious shortcuts are possible for res 
ducing the item pool, but so far these do not seem to lead to as 
effective measurement as do the simple fixed-step-size procedures. 

Only a few of the possible procedures and circumstances have 


f 
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It should be noted that no method of administering items and 


t Aa ichotomous item responses can produce better measure- 
ип is achieved at 0 = 0 by one of the “standard tests.” The 
NN Hat tries to approach this level of Measurement for all ex- 

i ` ш» Ts those at @ = 0. Can such tests be further much 
improved, within the context of dichotomous item response? 
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RELAXED RANK ORDER TYPAL ANALYSIS! 


LOUIS L. McQUITTY 
University of Miami 
Coral Gables, Florida 


Raxx Order Typal Analysis uses a strict definition for classify- 
ing objects into types; few objects satisfy its requirements. The 
method is relatively inappropriate for fallible data (MeQuitty, 
1963). 

This paper relaxes the definition of types in relation to the re- 
quirements of data and renders Rank Order Typal Analysis more 
widely applicable for classifying objects into hierarchical systems 
based on assessment of their characteristics. 

If the initial classification criterion is relatively restrictive, the 
method will usually initiate a classification but the classification 
may not proceed far. At this point, the criterion is liberalized 
minimally in order for the classification to proceed. 

On the other hand, if the initial classification criterion is liberal 
for the data, the method will either classify every object into one 
of only two categories or even fail to render a division. In either 
of these two eventualities, the criterion can be made more restric- 
tive. The classification criterion is adjusted to the requirements of 
the data as the analysis proceeds. ¿ê 


The Theory 

In Rank Order Typal Analysis, а type is defined as а category of 
Objects (specified in terms of selected characteristics) of such а 
nature that every object in the category is more like every other 
object in the category than it is like any object in any other category. 
— 8 

* This investigati by Public Health Service Research 
Gar No A Tostitute of Mental Health. 
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Furthermore, И only pure categories are sought а category of s 
objecta does not qualify as a type unless it includes qualifying sub. 
categories of 2, 3, 4 +++ n — 1 objects, where n equals the number 
of objecta in the category under consideration (McQuitty, 1964). 

By way of contrast, Elementary Linkage Analysis (McQuitty, 
1957) requires only that every object is classified with the one object 
most like itself, Between these two extremes are several degrees of 
freedom whieh ean be used to adjust the method to the require- 
ments of the data, 


The Method 


Rank Order Тура! Analysis requires that the indices of a matrix 
of interassociations be converted to ranks within columns. 

Whenever an object is introduced into a category it brings with 
it two groups of ranks, vis., those that it has in its column, showing 
the order in which other objects rank with it, and those that it has 
in its row, showing the ranks that it has with other objects. 

In Relaxed Rank Order Typal Analysis, а type is а category of 
objects of such a nature that an object has no rank in either its row 
or column above a specified maximum; the maximum is the minimal 
value which renders an internally consistent category. 


Illustration. 


The method is illustrated with the data of Table 1, which reports 
agreement scores between pictures of spoons as judged by one 
subjeet in terms of specified characteristies. These data were chosen 
because they had proven particularly difficult to analyze with the 
method of Hierarchical Classification by Reciprocal Pairs; an im- 
proved method had to be developed in order to classify the spoons 
(MeQuitty, Price, and Clark, 1967). 

The entries in every column were ranked from 1 to n — 1, 
where п equals the number of objects. Table 2 reports these ranks 
within columns. A slight deviation was introduced into the usual 
method of ranking. Suppose, for example, that the highest score is 
30, followed by 29, 29, 29, and 28. Thirty is assigned a rank of 1, 
and every 29 a rank of 2 (rather than 3 as in the customary method) ; 
28 is assigned a rank of 5 because it has four scores above it. Thé 
smallest numerical rank involved is assigned to all tied scores. 

Reciprocal pairs. A search is made first for the "strict" type 
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TABLE 1* 
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i.e., those searched for in Rank Order Тура! Analysis when о 
pure categories are sought, 

Let n equal the number of objects in a submatrix being examin 
to see whether or not the set of objects qualifies as a type. A type İ 
submatrix of objects of such a nature that every object in the su 
matrix has no rank greater than n — 1 with any other object of th 
submatrix (where the ranks are taken from the original matrix). 

In order for a pair of objects to qualify under this definitio 
neither object of the pair has a rank above one with the other û 
ject. If the two objects are represented by i and j, i has a rank of û 
with j, and j has a rank of one with i, and this outcome constituti 
2 reciprocal pair. 

Every matrix contains at least one reciprocal pair. Suppose i an 
j are the objects between which the highest entry in a matrix m 
diates. Object 1 is then highest with j and j is highest with i; the 
two objects constitute a reciprocal pair by definition. 

Table 2 contains four reciprocal pairs, viz., Objects 3-6, 3-1 
4-12, and 10-20. 

Typal triads. In a search for typal triads every object must hay 
no rank higher than 2 (ie, n — 1) in either its eolumn or ro 
Select any Object i of any reciprocal pair. Let it be Object 3 of Е 
ciprocal Pair 3-6. If Pair 3-6 expands into a type of three objects, 
must incorporate the one object second most like Object З. Objet 
3 has no object second most like it because two objects, 6 and 1 
are tied for being most like it, and both, therefore, have a rank à 
one with it. Since Object 6 is a member of the recpirocal pair whid 
is being tested for expansion into a triad, this leaves Object 16 to b 
selected for the test. (If there had been more ranks tied at a valu 
of one, all of their objects would have been tested one at a tim 
with the order being immaterial.) Objects 3, 6, and 16 are as 
sembled in Table 3, using their ranks from Table 2. They do по 
constitute а type because there is а rank higher than 2 in the matr? 
viz., the rank in Row 6 and Column 16. Pair 3-6 cannot be expande 
into a type of 3 objects and for this reason it cannot be expande 
into a type of more than 3 objects under the strict definition. Ї 
order for a category of апу size to qualify under the strict definitio 
all smaller categories must qualify. 

If Pair 3-16 expands into a type of three objects, it must includ 
Object 6. This gives the same non-qualifying triad reported Î 


LOUIS L. McQUITTY п 
TABLE 3 
Testing Paira 3-6 and 3-10 for Expansion 
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~o سس‎ 
* Size of the criterion under which the object qualises; Objects 3 and 6, lor example, quality 
м а pair under a criterion of 1 (n — 1 = D. 
* Objects 3, 6, 16, and 9 form а tetrad under а eriterion of & 
* Assigned to а type of Table 6 under a criterion of 9 or leu. 


Tabie 3. Pair 3-16 cannot be expanded under the strict definition 
into a larger type. 

Object 4 of Reciprocal Pair 4-12 has Objects 2, 7, 15, and 20 all 
tied for second highest with it. They are all tested, one at a time, in 
Table 4 for expansion of Type 4-12. They introduce one, two, 
three, and three ranks, respectively, above n — 1 = 2. None of 
them qualifies, and Type 4-12 cannot be expanded under the strict 
definition into a type of 3 or more objects. 


TABLE 4 
Testing Pair 4-12 for Expansion 
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* Size of the criterion under which the object qualifies. 


“ 


9 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The only otber reciprocal pair, 10-20, is tested in Table 5 for ex- 
pansion into а type of 3 or more objects. It fails to qualify because 
Object 7, the only object second highest with Object 10, introduces 
one rank above п — 1 = 2, Even though Object 18 is second high- 
eat with Object 20 (the other object of Reciprocal Pair 10-20), it 
need not be tested; its inclusion as one of the 3 objects would ex- 
elude Object 7 which is the only object second most like Object 10, 
and Object 18 would have a rank larger than 2 with Object 10. If, 
on the other hand, Object 18, the only object second highest with 
Object 20, had been tested first in lieu of Object 7, then Object 7 
would not have required a test. 

In the classification, thus far, Types 3-6, 3-16, 4-12, and 10-20 
have been realized; no other types of the same size can be realized 
ais larger types can be realized under the strict definition of а 

Relazing the criterion. The criterion for typal membership is in- 
creased successively from one by steps of one and is applied in ex- 
panding the types until all objects of Table 2 are classified into one 
of two major types or have proven that they cannot be classified. 
Each successive criterion is applied to every type being built before 
the criterion is relaxed further. An object satisfies the criterion if it 
bas a rank equal to or smaller than the criterion with an object of 
the type being tested for expansion. 


TABLE 5 
Testing Pair 10-20 for Expansion 
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_ 412-2, under this criterion, and по 
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A comprehensive approach would require а search for new re- 
¢iprocal pairs each time the criterion is relaxed. When it is relaxed 
from 1 to 2, a reciprocal pair qualifies if i is first or second most like 
jand j is first or second most like i. This step was not included in 
this pencil and paper example because it is too elaborate; it could be 
included in an electronic computer analysis. 

Expansion of the criterion could be stated in a fashion relative to 
the size of the type being tested for expansion. Under this approach, 
the criterion would be n — 1 initially and then relaxed succes- 
sively to n, n + 1,n + 2, etc, The present study uses the amount of 
absolute discrepancy, as outlined above, on the basis of the assump- 
tion that the purpose is to minimize the absolute amount of error in 
classifying every object. 

Under the absolute approach, the criterion for the present data is 
now increased from 2 to 3 and Object 7 qualifies to extend Type 
10-20 into a Triad of 10-20-7, as shown in Table 5. The number 
three is placed over Object 7 in Table 5 to record that it qualifies 
Under a criterion of three. 

Under the criterion of three, Object 10 brings in Object 15 for а 
test; Object 20 brings in Object 18, and Object 7 brings in Objects 
2, 12, and 17, as shown in Table 5. None of them qualifies under а 
Criterion of three. 

The criterion of three is applied to Pairs 3-6 and 3-16. Object 3 
brings in Object 9 on a test basis; Object 6 brings in Objects 9, 11, 
E 19, and Object 16 brings in Objects 1, 9, 13, and 14, as shown in 

able 3. 

Object 9 joins Pair 3-6 to form a triad under the criterion of 
three; it also joins 3-16 to form another triad under the same eri- 
terion. No other types can be formed with either Pairs 3-6 or 3-16 
without a larger criterion. Object 9 brings in no other object under 
this criterion. 

Under the criterion of three, neither object of Pair 4-12, Table 
4, brings in additional objects (over and above Objects 2, 7, 15, and 
20 brought in for tests under a criterion of two), and no 
types are formed from the objects already brought in for tests 
under a criterion of two. ep 

The criterion is now raised to four. Objects 4 and 12 still bring in 


no additions. biect 2 joins Pair 4-12 to form a triad, 
additions. However, Obj 1 hee uk 
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2 brings in Object 17 under the criterion of four, and no new types 
are formed, 

Under the criterion of four, Object 3 of triads 3-6-9 and 3-16-9, 
Table 3, brings in Objeet 17. No additional types are formed. 

Under the criterion of four, none of the objects of Triad 
10-20-7, Table 5, brings in any additional objects, and no new types 
are formed. 

The criterion is raised to five. Objects 10 and 7 of triad 10-20-7 
bring in no additional objects. Object 20 brings in Object 4. Object 
2 joins the triad to form a tetrad of 10-20-7 and 2. Object 2 brings in 
Objects 4 and 12, No new types are formed. 

The criterion is raised to six. Only Object 4 of Triad 4-12-2 
brings in an object, viz., 5, and only Type 4-12-2-7 is formed. Ob- 
ject 7 brings in no other objects under a criterion of six. 

The criterion of six does not change Tables 3 and 5. 

The eriterion is raised to seven. Only Object 10 of Tetrad 
10-20-7-2, Table 5, brings in an object, viz., 16. Objects 15 and 4 
join the type to yield Type 10-20-7-2-15-4. Object 15 brings in 
Objects 3, 9, and 13, and Object 4, through its associations in Table 
4, brings in all of the objects of Table 4, 

Table 4 is, therefore, next examined under the criterion of seven. 
No objects are brought in, and no types are formed. 

Tables 4 and 5 are combined in Table 6, listing the members of 
the two types next to one another. 

Under the eriterion of seven, Type 3-6-9-16 incorporates Object 
11, and Object 11 brings in Object 5, but no additional types are 
formed. 

The criterion is increased to eight. Object 14 of Table 3 qualifies 
to yield Type 3-6-16-9-11-14. Object 14 brings in Object 15; no 
additional types are formed. 

The criterion of eight brings neither additional objects nor types 
to Туре 10-20-7-2-15-4 12 of Table 6. 

i The criterion is increased to nine, Objects 10 and 20 each bring 
in Object 14. Object 2 brings in Object 8. No additional types аге 
formed. 

No additional objects are brought into Table 6 by the criterion 
of nine. Object 13 enters as a typal member in "Table 3. 

E UAR Could proceed under repetition of the above 
e analysis is completed. This would be appropriate with 
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TABLE 6 
Testing Combined Tables 4 and 6 for Further Expansion 
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® Size of the criterion under which the object qualifies. 
>The criterion under which Tables 4 and 5 were combined through Object 4. 
* Assigned to a type of Table 3 under a criterion of 9 or less. 


an electronic computer and a computer program. However, in a 
pencil and paper analysis, the steps can be shortened. 

All objects have now been assigned to either Table 3 or 6. Ob- 
jects of Table 3 which have already been assigned by а criterion of 
nine or less to types of Table 6 are indicated, and analogously, for 
objects of Table 6 with respect to Table 3. У 

Table 3 is analyzed observationally to determine ће criteria value 
at which its unassigned objects (unassigned in either Tables 3 or 
6) can enter as typal members. Object 1, for example, enters under 
а criterion of ten; the entrance criterion of each unassigned object 
is reported over its object code number in Table 3. The same pro- 
cedure is applied to Table 6. Objects are assigned to the types with 
which they have their lowest criterion number. In the case of any 
object not listed in both tables, such as Objects 8 and 18, а check is 
made with Table 2 in relation to the criterion of the table in which 
the object is assigned. Object 18 has a criterion assignment of 16 
in Table 6. Table 2 shows that it would have a criterion assignment 
of 19 if it were transferred to Table 3. It remains in Table 6. 

Object 8 has a criterion of 19 in Table 6, and it would have a 


rw 


CRITERION VALUES 
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criterion of the same size in Table 3. Since this is the highest rank 
possible, Object 8 is in a sense unassignable. It is, however, as- 
signed to Table 6 because it has fewer ranks of nineteen there even 
when adjusted for the fewer objects in types of that table. 


The Hierarchical Classification 


'The results of the classification are shown graphically in Figure 
1. Categories 3-6-9 versus 3-16-9 are optional, and the conflict is 
resolved at the next higher level where all four objects enter a single 
category. The chart shows the criterion levels under which objects 


classify and, thus, gives an indication of the validity with which they 
enter, 


Summary 


This paper develops and illustrates a method of classifying falli- 
ble data into larger and larger internally consistent categories. Each 
set of data usually starts with two or more categories which are 
built up gradually and which combine at various levels. The classi- 
fication is realized by relaxing gradually and minimally, step by 
step, the objective criterion of internal consistency. 


36 93 16 9 1 14 1319 1 17 5 10207 215724 12° 


Г OBJECT CODE NUMBERS 
Figure 1. Hierarchical classification by Relaxed Rank Order Typal Analysis 


| 


LOUIS L. McQUITTY 43 
REFERENCES 


McQuitty, L. L. Elementary linkage analysis for isolating ortho- 
gonal and oblique types and typal relevancies. EDUCATIONAL AND 
PSYCHOLOGICAL MEASUREMENT, 1957, 17, 207-229. 

McQuitty, L. L. Rank order typal analysis. EDUCATIONAL AND PsY- 
CHOLOGICAL MEASUREMENT, 1968, 23, 55-61. 

McQuitty, L. L. Capabilities and improvements of linkage om 
as a clustering method. EDUCATIONAL AND PSYCHOLOGICAL - 
SUREMENT, 1964, 24, 441—456. 

McQuitty, L. L., Price, L., and Clark, J. A. The problem of ties in а 
pattern analytic method. EDUCATIONAL AND PSYCHOLOGICAL Mza- 
SUREMENT, 1967, 27, 787—796. 


EDUCATIONAL лхо PSYCHOLOGICAL MEASUREMENT 
1971, 31, 45-55. 


THE STABILITY COEFFICIENT 


EDWARD E. CURETON 
University of Tennessee 


In a previous paper (Cureton, 1958) I gave a formula for the 
stability coefficient, and quoted the result in a later paper (Cureton, 
1965). I did not know then that essentially the same formula had 
been given previously by Remmers and Whistler (1938). The for- 
mula is correct, but both my derivation and the one given by 
Remmers and Whistler were slightly defective. A derivation that 
seems to me to be more nearly correct is given below, together 
with some further discussion. 

The inconsistency of a test may be defined as that part of its 
error of measurement which is associated with items or forms. One 
set of items can never draw forth from an examinee а set of 
reactions completely representative of the totality of potential 
Teactions which represent the ability or trait measured by the 
universe of items of which this set (test form) is & random or 
stratified-random sample. Hence another set (form), similarly 
selected from the same item universe but differing in specifie соп- 
tent, will in general yield a somewhat different score. Most of the 
literature on test reliability is concerned exclusively or almost 
exclusively with consistency. Two forms of à test are parallel if 
their items are parallel random or stratified-random samples from 
the same item universe. 

The instability of a test is that part of its error of measurement 
which is associated with the particular time and occasion on which 
it is administered. The ability or trait measured by a test, quite apart 
from learning, mental growth, permanent forgetting, ог mental 
decline, fluctuates with time. These fluctuations include reactions 
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to session differences in the examiner's procedures, the working 
conditions at each particular session, and random and cyclic vari- 
ations in emotional control, general fatigue, motivation, attitude _ 
toward the test, anxiety, working procedures including working _ 
speed, resistance to distraction, and access to memory, to name but | 
а few. The true score of an examinee would then be defined ag” 
his average score both on different forms of the test and on different 
examining occasions. As regards these occasions, however, we would 
like to assume that over the time span involved there is no ment 


last defined as loss of retention. Fluctuations in ability to recall, 
given constant retention, are elements of instability. 
For a long test, instability might be appreciable over a time 
as short as the time required to complete it, so that serial а 
ministration of two forms at the same sitting is not necessari 
equivalent to simultaneous administration (as by, say, the odd а 
even items of one double-length form). On the other hand, cyclic 
variations in some of the elements of instability come in very long 
eycles. Differential seasonal variation (differential over persons) 
is at least a tenable hypothesis, and euphoric-depressive cycles 
within the normal range may be even longer. It would seem, then, 
that no practical interval between the administration of two forms 
of a test is long enough to provide assurance that the two sets of 
instability errors are uncorrelated. | 
At least equally important is the fact that true-score change is 
а continuous process also. Learning occurs as an examinee proceeds 
from the first to the last item of a single test, and the amount 
learned is different for different examinees. Between test sessio! 
each examinee is forgetting whenever he is not learning, and whe 
Sree he ж learning in areas which are irrelevant to the area teste 
If, in addition, the interval between test sessions is substantial, d 
а mental growth may be appreciable in children, and d 
erential mental decline in older adults. Hence no interval is sho 
enough to rule out the possibility that differential true-score chang 
к occurred. So far as measurement, is concerned, fluctuations: 

миз ee "eis are always contaminated by changes in 
ot the function. Instability, as best we can measure it, 


an unknown combination of function fluctuation and functi 
change. a 
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Theory 
Using the linear model of weak true-score theory, we can write, 


zı = kw +8) +4, (0 


т, = kn + 8) +. 
Here z, and т» are raw scores on the two forms of the test, Za and ту 
are the corresponding true scores on the occasions when the two forms 
are administered, s, and sa are the instability errors on the two 
occasions, e; and ез are the inconsistency errors of the two forms, 
апа kı and К» are constants associated with possible inequalities in 
the raw-score units of measurement, and hence in the reliabilities 
and variances of the two forms. We can assume without loss of 
generality that all scores are taken as deviations from their means 
as origins, that Ta, Zo, $1, and 82 are measured on the same true- 
score scale, and that e; and ез are measured on the raw-score scales 
of ту and za respectively. We assume that Za and zy аге uncorrelated 
with sı and ss, that all four of these are uncorrelated with e; and ёз, 
and that е; and es are uncorrelated with each other. We do not 
assume that za is uncorrelated with 2, or that sı is uncorrelated 
with s2. 
With this model, 


Tis E kıka(Gas + 04,24) /0102- (2) 


Here о,» is the true-score covariance, and Gea, is the со "ia 
of the instability errors. By the usual variance ratio definition, the 
consistency coefficients are 


С, = К. (cà + [2977 с’, (3) 
C, = (+ eos. n 
We will assume that over the interval between the administra- 
ton of the two forms, neither the true-score variance nor the vari- 
ance of the instability errors changes appreciably; i.e., that 
ao 5 at = n 
с, “= on EE а. 
Then (3) becomes | 
О, = kilos + e,)/ с, i (4) 


С, = (02 + oos 
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and 


VCC, = kk," + e,)/eis. 
Then dividing (2) by (5), 


„= f, ala tonn, 


СС, % tv, 

From the second expression of (6) it is clear that r, is of th 
form of a correlation corrected for attenuation: it is the interforn 
correlation corrected for the attenuation due to the inconsistenci 
of the two forms. It seems reasonable to term т, the stabili 
coeficient of the test over the interval between the administration 
of the two forms. 

For computation, гү; is the product-moment correlation between 
the two forms, and С, and C, may be computed by the Кийе 
Richardson Formula 20 or by the split-half method and 
Spearman-Brown formula. 

If there were no changes in true scores over the interval between 
the administration of forms 1 and 2 of the test, we could say 


Те = 1 = у, 


and (6) would become 


2 
т, = 2 9. dv 


VCC, Y e + c, 


But since the second expression of (7) is the same as that of (6), 
іе clear that we have no way to determine empirically whether a 
is appreciably lower than о, or not. As the length of the interval 
“Сш will decline from c,’, its value for simultaneous 
а to zero when the instability errors on the second 
2 پاچ‎ random with respect to their values on the first occasion 
гуз thereafter it will probably fluctuate between zero and some lo 
positive value associated p 
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administrations should be noted explicitly, and ideally the time of 
day and day of the week of each administration, and any inter- 
vening events (e.g, examinations, snowstorms, epidemics, athletic 
events, and the like) which might affect differential instability 
to a greater than average degree. If the time interval is long, any 
known relevant learning experiences, affecting some examinees but 
not others, which occur between the first administration and the 
second should be noted also. The stability coefficient is always 
attenuated in greater or lesser degree by true-score changes, and 
is a function of the specific interval between the first administration 
and the second, not merely of the length of this interval, 

Since the consistency coefficients enter into the formula for the 
stability coefficient, both forms of the test should be administered 
ideally without time limits, and in practice with time limits sufi- 
cient to permit at least 90 per cent of the examinees to attempt 
every item. In the latter case every examinee's answers should be 
augmented by a random answer for every item omitted, before 
scoring. 

Note that in (6) and (7) the values of hy, ks, eı, and ey do not 
enter; they cancel when we divide (2) by (5). From this it follows 
that while forms 1 and 2 must be parallel forms in the sense that 
they both measure the same function or combination of functions, 
they need not be equivalent: they do not have to ‘be equally 
reliable or equally variable, or to measure in comparable raw-seore 
units. One form may be a long form and the other а short form, 
so long as the two forms consist entirely of different items mew 
ing the same function. 

When the split-half method is used to compute the consistency 
coefficients, r, may be computed by formulas alternative to (6). 
The model is 


o d 
Half-Test I п 
А E^ n 
BA D. 
Then * 
NE EI EI EIE (8) 


Y 4 Visa 
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۷ Т7. 237 з4 (9) 
Tia 

In general, (9) is the preferred formula, since it does not requi 

that either z, and ту or za and z, be equally reliable or equally varis 
able. But if these conditions do hold, (8) appears to be at least ag 
good as (9) and perhaps slightly better. If (6) is used, with С, and 
С; computed by the split-half method and the Spearman-Brown 
formula, the two split-halves of each test form must be equally: 
reliable and equally variable, and (6) is algebraically equivalent. 


Ta 


If we start with (1), we can reasonably set №, = ks, and hence 
omit the k’s from these equations: the same form should measure 
in the same units (and hence be equally reliable and equally 
variable) on two different occasions. We then have 


Tı -2,-5de6, (10) 


® = 2, + s: + ez 
nem (Cas + Фа, nz 0,,..) /0101. (11) 


The argument based on identical items would seem to imply that 
Ton = 1, and hence that c, = c,". Then assuming again that 
% =o = c, andg,,” = c,? = 02, the variances are 


2 г НРА з 2 2 
А 9; =o =o, +0 о), 


and 


Ca + Cais F [d 
с. + Pu zi: e 


Tua = 


(12) 


: точ to see how (12) can be interpreted as а stability coefficient: 
с, , the inconsistency-error vari Р 
Y-error variance, appears in both the numerator 
and the denominator, 


with (0), it is evident, 
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of particular item responses given at the first session will be only 
а part of these effects. But even apart from these considerations, we 
cannot assume that e, == е, for every examinee. For many items, the 
probability that an examinee will mark the right answer is neither 0 
nor 1. Then quite apart from any over-all differences in test-taking 
ability represented by s, and s, an examinee may mark different 
answers to the same item on two occasions even when the probabilities 
remain constant. Hence, on both counts, (12) must be replaced by 

б» + One, Fs 
СГТС ТУЫ аз) 
and (13) may give either an inflated or an attenuated estimate of 
the value given by (6), depending upon the unknown magnitudes of 
the perseveration and item-probability effects. 

If we divide (11) by (4), with the k's omitted from the latter, 
we obtain 


Tis 


While c.,., < ¢,’, it is almost certain that it will remain positive, in 
which case (14) will always give an inflated estimate of the stability 
coefficient. So far as I am aware, no one has proposed (14) as а 
formula for the stability coefficient, and I certainly do not propose 
it here. 

So far as I can tell, the test-retest coefficient has no clearinterpreta- 
tion under weak true-score theory. It depends upon а ‹ 
of instability errors, inconsistency errors, perseveration е, ин 
item-probability effects, whose relative contributions toits magnitude 
cannot be untangled. The only exception would seem to 4i 299 
speed test (such as, e.g., a number checking test), with the ilm 
administration coming some weeks or months after ср test- 
a case, it would perhaps be not unreasonable to interpret the 
retest coefficient as an inter-form reliability coefficient. 

Data 

We consider first the proposition that instability may be coc an 

able over a period as short as that required by а group s FT 


to take one test. 5 iven to a large 
A final examination of 60 four-choice items was given 


class in elementary psychology. To minimize copying, students 
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sitting in alternate seats took a "green" form (Form С) and a 
"white" form (Form W). The two forms consisted of the same 
questions arranged in different orders. In both forms the items 
were arranged essentially randomly with respect to difficulty, 
discrimination, and topic. The analyses of the two forms may be 
considered replications, with the same items but with different 
subjects and different split-halves. 
For each group, four scores were obtained, each on 15 items: 


zı: odds of first 30 

z,: evens of first 30 

Za: odds of second 30 

т: evens of second 30 
Thus тз and rs, will be consistency coefficients, while rs, 7:4, T23; 
and rs, will be inter-form correlations. The stability coefficients 
were computed by (9). The results are 


White form Green form 
-an ccc 1 у^ >. 


N 726 734 
C= уту, .5178 .4825 
T Улу, .5040 .4918 
m= r/C -973 1.019 


The proposition is not proved by these data, even with samples of 
over 700. One possible reason is that without stratification by topic, 
the subforms were somewhat lacking in parallelism. 


grade students. Consistency 
odd and even items of each 


With the one exception of Test 2 the stabili i re 
clustered in the re Ц 1Шу coefficients а 


gion between .89 and 96, and the difference 


between 1.067 and the next highest (.110) is almost twice as great 


" 


M LT 


“— ТРИЕ —— ВИИ 


] EDWARD E. CURETON з 
Тезї r 


u Taa Tis Vrifn т. 


1 689 .602 .582 6440 904 
2 557 .644 639 5989 1.067 
3 644 .718 21 ‚6776 916 
4 632 JE EE 644 .6732 957 
5 600 .671 .594 6655 892 
6 656 .670 .633 6630 955 
7 728 .676 665 7015 948 
8 754 .677 077 7145 948 
Mean ‚665 .671 .632 .666 .948 


аз the distance between the next highest and the lowest (.065). 
If we regard the 1.067 as anomalous, the mean of the other seven 
is .931. 

While tests of the hypothesis that a single correlation corrected 
for attenuation is not significantly different from unity exist. 
(Forsyth and Feldt, 1969; Lord, 1957; McNemar, 1958), there is 
no test for the hypothesis that the mean of seven or eight such 
coefficients, all correlated, does not differ significantly from unity. 
In view of the sample size, however (988), it seems safe to conclude 
that some instability exists over the one to three day period. 

In a previously reported study, (Cureton, 1939, 1965), three 
27-item forms of the A.C.E. Opposites Test were mimeographed 
on one sheet and administered consecutively to 187 undergraduate 
students in five classes on a Friday. Three other forms, mimeo- 
graphed also on one sheet, were given to some of the same classes 
on the following Monday, to some on the following Wednesday, 
and to some on the following Friday. All were given without time 
limit. The average of the six within-sheet correlations was .729; 
the average of the nine across-sheet correlations was 500.! If we 
regard the within-sheet correlations as estimates of consistency, the 
Stability coefficient is .500/.729 = .686. 

One other study, also previously reported (Cureton, 1965; 
Loveland, 1952), included the Verbal Reasoning Test of the D.A.T., 
Form A. This test, along with all others of the D.A.T. except 


E m 
"Incorrectly reported as .517 in Cureton (1965). 
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Clerieal Speed and Accuracy, was administered to 572 students 
in grades 9 through 12 of one high school, and the same form was 
re-administered eight days later. Odd and even items were scored 
separately for each administration. We have the model, 


Occasion 
Form I II | 
Оаа LA Ta 
Even Ti їп 


Неге Fı r and та п are consistency coefficients, r; ц and гу з are inter- | 
form correlations contaminated by differential practice effects, and , 
тіз and ry ц are test-retest coefficients. The data are l 


C= Vn = 849 
т = (nud nj/2 = .805 
r, — r/C = .949 
т. = (à + r1)/2 = .846 


The stability coefficient is appreciably higher than the value for 
the A.C.E. Opposites Test, and is in fact higher than the mean value 
for the Davis reading tests when the anomalous 1.067 is omitted, 
even though the interval was eight days as against averages of 
about five days and two days. How much of this may be attributed 
to contamination due to differential practive effects and use of the 1 


same form on both Occasions is not clear, Note also that туу, the - 
test-retest Coefficient, is much lower than r, 


an interform correlation, however, 
-996, which is almost surely a gross 
These four studies deal with 


If we regard it as 
and divide it by C, the result is 
ме of the stability. 
ifferent types i t all 
of them, at least, are verbal tests, We 5 Сине l 
intervals for the four studies as about one hour, two days, five 
days, and eight days. For consecutive testing the ШЫПЫ: A near | 
zero, for the two-day interval it is slight, for the five-day interval 
it is substantial, and for the eight-day interval due in some part 
ў іо contamination, it is again slight. We can клн very tenta- _ 
tively that for verbal tests the instability curve is of the inverted-S 


type, with instability increasing slowly at first, then faster, and _ 


EDWARD E. CURETON 55 


finally more slowly again. All four of these studies probably repre- 
sent only а part of the interval over which instability is increasing 
at an accelerated rate. 
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INTEGRATION OF CONCEPTS OF RELIABILITY 
AND STANDARD ERROR OF MEASUREMENT 


JOHN L. HORN! 
University of Denver 


Wuar are the assumptions underlying derivations of various 
indices of error of measurement and such coefficients of reliability 
as KR-20 and KR-21? Over the years there has been a great 
variety of instructive discussion relating to questions of this kind 
(cf. Cronbach, 1951; Cronbach, Rajaratman and Gleser, 1963; 
Gulliksen, 1950; Henrysson, 1959; Hoyt, 1941; Kuder and Rich- 
ardson, 1937; Lord 1955a; 1955b; 1957; 1959a; 1959b; 1962; 
Novick, 1966; Novick and Lewis, 1967; Penfield, 1967; Winer, 
1962 and others). Recent articles by Lord have been particularly 
helpful in indicating the minimal assumptions under which à reli- 
ability coefficient may be obtained and in showing the basis upon 
which one would need to justify use of one standard error of 
Measurement (SEM) for all scores. Yet it is true that interesting 
and important relationships between various derivations still re- 
main somewhat obscure and the implications in use of various 
Ways of estimating a standard error of measurement are by no 
means evident, The purpose of this paper is to explicate some of 
the problems implied by these statements and to indicate some of 
the practical implications of various proposed solutions. This may 
also point the way toward a clearer conceptualization of basic 
theoretical issues in this area. 


Random Response Models 


One of the more logically compelling ways 
and a reliability coefficient (ri) is in terms 0 
See oti щш, 

1I wish to thank Professors Bernard Spilka and hom F: 


к Betty Rossman for suggestions which are proci] 
ideas developed in this paper. 


of defining an SEM 
f а classical (Fisher- 


Humphreys and 
several of the 


57 
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ian) inferential model. Perhaps the first clear indication of this 
kind of definition was given by Hoyt (1941), but Lord (1955a; 
1955b; 1957), Penfield (1967) and Winer (1962) have also de- 
veloped the basie idea. It would appear, however, that we should 
consider two somewhat different variations on this theme. These 
correspond to the distinetion between the alpha or KR-20 formula 
for reliability and the formula referred to as KR-21. But the 
implications may run deeper than this. 

In the most direct approach to reliability via an inferential 
model, the m items (stimuli) of a test (any measuring procedure) 
are assumed to be a representative sample drawn from a popula- 
tion of stimuli to which a subject could respond to indicate а 
magnitude of the attribute in question. It is assumed (for the 
purpose of rejecting the assumption) that the responses to stimuli 
could be random. That is, the hypothesis which is to be rejected 
stipulates that the m variables generated by N subjects responding 
to m items are random variables. It need not be assumed that all 
items have exactly the same difficulty level (eccentricity for non- 
ability items), but the model implies that the variation in eccen- 
tricity, as in other statisties for the distribution, will be only that 
expected by chance. It is assumed that the scale value for a sub- 
ject will be obtained by simple linear combination of the response 
scores for the individual items. It presents no theoretical problem 
to divide scale values by a constant and it is convenient to do 
this, making the constant m, the number of items. A scale value 
can then be seen as a mean over m variables drawn representatively 
from a population of such variables, 

Tf an individual would respond to all stimuli in the population, 
his true score, r, would be the mean in this population. It is 
assumed that the mean, t, of responses to a sample of stimuli is 
an unbiased estimate of тү. In the random Tesponse model there is 
an assumption that responses to items are drawn independently— 
selection of one does not determine selection of another. Obviously, 
however, this is a model assumption which an investigator would 
hope to be able to reject. That is, he would expect that correlation 
between items would be significantly larger than zero. If stimuli 
are drawn from a population of stimuli responses to all of which 
are indicative of the attribute to be measured, then the correlation 
between such items will be positive and the random response model 


__———„———. rr للل‎ ÃÑ [ar 5 
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will not adequately represent the behavior in question. Ап alter- 
native hypothesis can thus gain credence. 

Several consequences follow from this conceptualization. First, 
the variance of item responses around an obtained score, f, is 
seen to be an efficient estimate of the sampling error (in the popu- 
lation of stimuli) for measurements of value, t. In other words, 


= ys — (1) 


7 m— 1 

(where Хуу represents the response of an individual i to stimulus 
j) estimates а standard error of measurement for subjects obtaining 
the score t;. This implies that the SEM for different measurement 
values will be (can be and, in general, is) different, as Lord (1957) 
has emphasized. This idea also has empirical support. For example, 
there are MeNemar's (1942) well-known results showing that high 
scores on the Stanford-Binet test have а larger error variability 
than do the low scores. 

Second, this statement of the problem leads directly to a test 
of an hypothesis that measurement has been achieved. If measure- 
ment were not achieved, then the implication is that response to 
one stimulus in the set of m stimuli does not indicate presence of 
the same attribute as is indicated by response to other stimuli in 
the set. In other words, this implies that responses are indeed 
random. If responses to the separate stimuli are random, then the 
above-mentioned o, for each subject is an estimate, based upon 
à sample of size m, of the variance of а population of such random 
responses, If it is assumed that sampling of subjects is independent 
—the selection of one subject does not determine the selection of 
another—and that the о2 estimates are homogeneous, then the 
Pooled variance 


У e, Уй у (Xa — і) @) 


tem Marr уы тсе 
© pop-w = N iud N(m nd 1) 


can be regarded as an efficient estimate of the population variance 
timate derived from vari- 


ш the population. This is a variance esuma rien 
ы within subjects. The subscript pop- 1n (2) represen 8 
act, 


Another estimate of the population variance can be obtained 
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from the marginal variance among t, values. For since any f, is 
а mean for a sample of size m in the population, 


Et- 


em ESX-LC (3) 


is an estimate of the variance of means; and since the variance 
of means and the variance of elements have the relationship, 


2 
Em 
eu sample size e 
(ав in the usual developments of analysis of variance) it follows 
that 


2 
eus = mite T G) 
is an estimate of the population variance. 
The estimates of (2) and (5) are independent. The ratio of 
these independent estimates of a variance 


Р = Sees т (4 УУ — 1) @ 
has ап Р oim ith (N En f 
n wi - = 9 
1) and N(m — 1) degrees 
These developments thus provide a basis for testing an hypoth- 
esis counter to the hypothesis of random variation. If the F is 
ек, the pertinent alternative hypothesis becomes tenable. 
= stipulates that subjects Tesponses to the various stimuli are 
independent; that, in fact, responses to one stimulus indicates 
presence rh the Same attribute as is indicated by responses to other 
stimuli. is a basis, then, for rejecting a claim that the re- 
sponses do not indicate measurement. 
When yop» is appreciably larger than о? оь the hypothesis of 
equality of these two variance estimates can be rejected. This will 
occur when the covariance term, > X racio, in the general expression 


for the variance of a linear composite variable 


2 
= Loft 
is p iably 1 r th i 07 pi Tj,0;0, (7) 


: riances will be nonzero. The 
implying that F will be large and 


l 
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thus making it likely that а “по measurement" hypothesis сап be 
rejected and an alternative hypothesis—implying measurement 
сап be accepted, It is on this basis, then, that Lord's development of 
SEM and reliability theory can mesh with hypothesis testing. 

Also, this development clearly indicates that when the sum of 
the covariances tends to zero, the variance of a composite tends 
to equal the sum of the variances of the components. Under these 
conditions if items are dichotomous, with about one-half of the 
subjects responding in each way to each item, the distribution of 
the composite scores will be a symmetric binomial that approxi- 
mates a normal distribution. On the other hand, as the sum of 
the covariances increases, the variance increases and the distribution 
form becomes more platykurtic, approaching U-shape as item 
intercorrelations approach 1.0. Clearly, then, an approximately nor- 
mal distribution can obtain under conditions of no measurement, 
whereas departures from normality indicate that one can retain an 
hypothesis that measurement has been achieved. 

Under the assumption of no measurement, the median of the Р 
statistic will be 1.0 or less. It is reasonable to argue that a de- 
seriptive statistic which would indicate the extent to which the 
“no-measurement” assumption does not fit with the data is one in 
which an obtained value of F is compared with an expected value, 
аз by subtracting the median value from the obtained value— 
F-l. If this is large it indicates & departure from the “no- 
Measurement” conditions of the random-response model, However, 
the value of F-1 is partly a function of the number of degrees of 
freedom and this would vary from one application to another. To 
get a statistic that would not vary in this manner one could divide 
Out the degrees of freedom factor, thus 


п = 251 (8) 
Аз demonstrated by Lord (1957) if an assumption of dichoto- 
mous items is made, the reliability coefficient defined in this manner 
is closely similar to the coefficient widely known as KR-21. 
We can see this by substituting the expression for F into (8), 
Whence we obtain 


oS SÉ |. o^" — Г 
um  - ££ _ —- 


pope (9) 
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and then simplifying the within-variance term. To do this note 
that 


- э 
21 (24 — tQ 2; zu — mt? узи aa = m( 22x) 
Ыз eer s ЧЕ И А 

т — 1 т—1 т — 1 

mz — (zy 
m(m — 1) 
where under the assumption that zy is a dichotamous variable, it can 
be assumed with no loss of generality that Z x? = > ху. Also 
= zy is simply the sum of the item responses—i.e., the total score, 
which, for convenience, we may represent as T,. Thus, the variance- 
within can be seen to reduce to 


(10) 


2» | т Ez = (2D 21)" | i (mT, — T?) 
Nm(m — 1) a: "Nmm-1 
di mT va Sz == T T(m яр, Т) ad Sz (11) 


m(n—1 m(m-—1 
(where it is recognized that since S," = Y^ 7,2/N — T° it follows 
that Z T//N = Sr + T). imul 
The variance-between can be put into a somewhat similar form 
by observing that the variance for the proportions in (5) is 


Ome = 


(12) 
Due ا‎ oe 
m N-—1 25 
Making these substitutions for o do : T" 
suming that ст? = 8,2, we obtai pop and c?,,,4, into (9) and as 


= _ 1 – - 8 

ru =1 “ea De 

= 287" — Tm — Ty a 
(m= DS; E e n 


mSr 
او‎ з To commonly seen expressions for formula KR-21. 
Жа Bes at this computable Tepresentation of a concept of 
zeliability ês assumed that items were dichotomous. However, 
this assumption was not contained in equation (9), from which the 


(13) 


| 


| 
| 
| 
Г 


E 


“jnition of reliability are essentially only those of linear analysis 
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KR-21 formula was derived, and this, also, was computable. Hence, 
‘it can be argued that the basic assumptions involved in this def- 


of variance. In the structural model underlying this model there 
жаз an assumption that an obtained response was comprised of 
two components—the true and the error components—which com- 
bined additively. In the statement that the variance estimates are 
“independent there was contained an assertion that the expected 
covariance for the error and true components was zero. The var- 
jance between persons could then be seen to contain the variance 
due to the true score plus error variance, while the pooled within- 
‘Person variance could be seen to be comprised only of this latter. 
However, it was not necessary to assume that the variances for 
‘items were equal, as sometimes seems to be supposed in discussions 
Of the rationale for the KR-21 formula. Rather, as noted, the 
assumption of this kind was merely one of homogeneity of item 
variances. In other words, the KR-21 formula need not involve 
ап assumption that item difficulties are precisely equal; although 
it сап seem to involve this assumption when it is derived alge- 
braically from the KR-20 formula (see Horst, 1966 for а recent 
example of this derivation). 

Hoyt’s (1941) well known statement of a concept of reliability de- 
rives from a somewhat different set of assumptions than those stated 
above. In this statement of the problem Hoyt reversed the roles, 80 
to speak, in what is commonly referred to as 8 single-factor- 
Tepeated-measures analysis of variance design. That is, instead of 

regarding the items of a test as treatments, whence it becomes rea- 
sonable to use the variance of the residual (07 rea) —what ls some- 
times referred to as interaction—as the variance-estimate term 
(error) in testing a main effect due to items, Hoyt treated the sub- 
jects as treatments and then went on to use cra as the variance- 
estimate term in a test for a main effect due to subjects. This leads 
to a statement of reliability which is formally very similar to that 


9f equation (9), namely 


F-1 Ores 
re Eg ad 


but which, in fact, is different in some interesting, if not practically 


important, ways. 
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In particular this statement of reliability has the effect of ге. 
moving variance resulting from differences in item difficulties. At 
& practical level this means that the reliability calculated by this 
procedure usually will be larger than the reliability calculated by 
the procedure of equation (9). 

This can be seen by recalling that equation (9) involves a 
within-person sum of squares which can be partitioned into a term 
representing the sum of squares of the residual plus a term, usually 
not zero, representing the between-item variability in eccentricity 
(difficulty). That is, as is shown in detail by Sheffé (1960) 

x 
У (zu d ty 


"M 


N = т 
- L iG. - tL—M,-»y-NY(uM,-iy (15 
1 
SS(Within subjects) = SS(Residual) + SS(Between Items) 


SS, = SSe + SSM 
where M, represents the mean (over N) for item (treatment) j, 
SS stands for sum of squares, and the meanings of the other 
symbols are as defined previously. The N(m — 1) degrees of 
freedom associated with the within-subjects sum of squares is 
similarly partitioned into (т — 1) for the between-items term 
and (m — 1)(N — 1) for the residual. Thus, solving (15) for 


pi the variance estimate based upon the residual can be seen 


f: _ _SS.— SSM 
oar = His D (16) 


In the symbols of this equation opopo of equation (9) is 

2 A OES, 

E TT ET (17) 
For purposes of comparison it is con 


sides of (16) by (N-1) and on both 
solve for oes thus 1 


venient to multiply on both 
sides of (17) by N and then 


N — ne, = S8. — SSM 
m—1 
-NS . SSM 
рор— т EE zi 


(18) 
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"This indicates clearly that it is possible—with small N, thus а 


large N/(N — 1) ratio, and small sum of squares due to differences 
jn item difficulties—for the variance of the residual to be larger 
{һап the within-person variance. In fact, if item difficulties were 
Precisely equal (implying that SSM is zero), then o*, must be 
larger than о?роуме by the ratio N/(N — 1). More often in the prac- 
fice of research N will be large, the ratio of N to (N — 1) will be 
very nearly 1.0, some variance in item difficulties will obtain, thus 
vL. will be smaller than oop and the reliability computed by 
(14) will be larger than the reliability computed by equation (9). 

After defining reliability in а manner equivalent to equation 
(14) Hoyt, observed that : "It may be interesting to some who are 
familiar with the work of Kuder and Richardson that the foregoing 
method of estimating the coefficient of reliability gives precisely 
{һе same results as formula (20) of their paper. This fact can 
be easily verified algebraically." 

The algebra to which Hoyt refers, although a bit more tedious 
than one might hope, can be indicated by first recalling that the 
variance of the residual reduces to 

о Юе ee (19) 
res m m(m ie 1) 
(Winer, 1962; pp. 120-122). Then notice from equations (15), (7) 
and (12) that the variance between subjects can be expressed аз 
or + гат 
СЕ, LE A = m (20) 
With these observations the reliability of equation (14) can be 
Tather directly reduced to 


(za он 
(СУ, Дыр Упит) 2 


(Doe E Xen. 
(m — 1(2, о; + 2 D raion) 


= (m= D Dot + m=) 22 orani 


(т = DSF 
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_(т-1 с — 71004 
(m — DS, 
Qn ruin + тасш, — ИЛЛ 
(m — 1)8, 
„Ж Tal jn 0 


one of the familiar forms of the KR-20 or alpha formula for 
reliability. 

A conclusion to be drawn from these developments is that 
usually—i.e., when N/(N — 1) is nearly 1.0 and item difficulties 
differ—KR-20 should yield a somewhat larger estimate of reli- 
ability than KR-21. Acutally, in many practical situations the 
difference between the two will be small—of the order oí .05 or 
less. However, the fact of the difference between the two should be 
kept in mind when considering the standard error of measure- 
ment formulae developed in the next section. 


Standard Error of Measurement Models 


To arrive at an estimate of a standard error of measurement, one 
may solve equation (9) for o?popw in terms of Ти, thus, 


SEM(i) = VT sone = 8, Vl-—r, (22) 
where, for convenience in further derivations, the small ё may now 
be regarded as expressed in deviation-score form? In this SEM 
опе is, in effect, pooling the individual standard errors of шез- 
surement, o; described in equation (1). Here, then, is an assumption 
of hor of error variances. This is the same assumption 
as is involved in the SEM derivations based upon a consideration 
of the correlation between hypothetically parellel tests, where, how- 
ever, it would be recognized as an assumption of homoscedasticity- 
E in стен їп аң is the same as that which is obtained 

1 of what is ed the true-score substitution model 
(Gulliksen, 1950). In this the obtained score їз not regressed 10 


give an estimate of the true score. The model for estimation i8 
simply 


жо = (23) 
2 Also fi i тал ы 
divided ws т = t is still regarded as the total score (over m items) 
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is, the true score is estimated as the obtained score and tho 
standard deviation for repeated observations of & is the SEM 
in (22). 
` This model is different from a fallible-score substitution model 
in which an obtained score is regarded as an estimate of another 
Obtained score. That із, in setting up confidence bounds around s 
given t, score, the assumption (perhaps implicit) is that this score 
is to be substituted for others that might be obtained in another 
measurement of the same type. The estimation in this case ean 
be symbolized as 


la = ta (24) 
Where 1 and 2 indicate the first and second occasions respectively. 
__ În this case the standard error for repeated observation is as- 
sumed to be the standard deviation of the difference between two 
‘obtained scores, Under the assumption that sigmas for separate 
tests (occasions) are equal (and equal to оп), this reduces thus 


_8ЕМ(@) V o; + с — 2712010 = У с, +... = iat (25) 


j rh e, V 2(1 — ri) 
| ‘The r in this SEM is often regarded as different from the ry in 
equation (22), but the two can be seen to be the same in derivation 
and in most situations where an SEM is used. The тз is the correla- 
tion between two fallible forms of a test. The ги, on the other hand, 
- usually is viewed as the correlation between a given test and a hypo- 
U thetical test just like it, as in the most widely used derivation of 
"the KR-20 formula. That is, re is defined as the correlation between 
two linear composites of the form 


na ti = (a + +H Tm) (26) 


bz 4 = Oy dq Tz) 
_ Where £/ is a hypothetical parallel measurement of &. Then 


ll 


* Y Ess RD 
E MA S? + S rj SS. VE 8 + 2 E гы5 Se 


Q7) 


tated in this way the problem is that the components involving 
nes are hypothetical and so (27) cannot be computed directly. 
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However, since it is assumed that the item variances and covarie 
ances in the hypothetical test are of the same magnitude as in the 
obtained test, the second term in the denominator (representing 
the variance of the hypothetical test) can be regarded as equal {0 
the first term in the denominator and the average covariance be- 
tween obtained and hypothetical components can be regarded as 
equal to the average covariance among the obtained components. 
That is, it can be assumed that 


SDS + У) Lv SS, © DS; + У У "8,5, (28) 


апі 
i, 22 ra SS: = > Dy ra S;S. ( oy 
m m(m — 1) 

whence (28) and (29) can be solved for the hypothetical compo 
nents on the left in terms of the obtainable components on the 
right and the results can be substituted into (27), to give 
KR-20 formula derived in equation (21). By adding and subtra 
ing XS? in the numerator, this expression is put in the more 
common form. 


2 
пет gs - 282) E 
И the assumption is made that items are dichotomous, so th 
8, сап be replaced by p,(l — р;), where p; is the eccentricity of 
item j, and it is further assumed that all p; are equal, then KR-21 
сап be derived directly from (30). As noted earlier, however, it? 
not necessary to derive KR-21 in this way. 
If, moreover, the standard deviations for the components in (21) 
are assumed to be equal and У) У) ra із replaced by m(m — 1)fim 
where F; is an estimate of the “typical” correlation among CO 
ponents (perhaps the average), then (21) reduces to the genera 
Spearman-Brown formula * 


Tu тї (3 ) 


= EE o 

1+ (т — 0D, 
as is well known. But now given that one knows the correlatiO! 
between two fallible tests and this is the тү, given in equation (2 
for the SEM then this may be looked upon as one kind of 7; estima 
to use in equation (31). Moreover, since the standard error in 
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case is to be estimated on, and applied to, just one of the two forms, 
{һе r,, being calculated is for a test that is just one times as long 
аз the tests that were correlated. Thus in this case m in (31) is 1 and 


ы Dr, 1 а 
1+ (1- Drs Es 

Under these assumptions, then, the true-score and fallible-score 
substitution models and the corresponding SEM (t) and SEM (dif) 
may be said to involve the same summary statistics (c and r) 
and can thus be compared as to their implications in use. 

In contrast to the substitution models are models in which the 
estimated score is obtained as a linear regression from the ob- 
tained score. The fallible-score regression model is the simplest of 
these. This is the usual fixed-variate regression model. One variable 
is estimated from another using the equation 


(32) 


Tu 


ї = s (33) 
Whence the standard error of estimate 
с@— 9 = ФЕМ?) = o. V1 = та? (34) 


(fr stands for “fallible regression”) is then treated as an БИС 
The гъ in this case has the same meaning as the riz introduced 
in the discussion of the fallible-score substitution model. Thus this 
SEM сап be compared with the SEM of equation (22). 

In a true-score regression model for error of measurement, the 
estimation equation is 


e) m 


and the SEM is again, as in the case of the fallible-score regres 
sion model, the standard error of estimate 


SEM(tr) = e, VÀ — т. (36) 


(tr stands for true regression). But here the values involving т, 
the true score, are hypothetical. Thus % is necessary to work out 
the computable form of these. 

Using the assumptions of the classical true-score and error-score 
derivations, ry? can be seen to be equal to re, a8 defined. above, and 
‚ the SEM can be seen to involve the same summary statistics as are 


T EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


contained in the other SEM's. That is, the fallible score, t, is as- 
sumed to be a simple linear combination of a true-score component 
and an error-score component. Then the correlation r, is derived as 

het ide (r + dr в, renes, n 37) 

" New, Nev, сүс, в, я 

where it is assumed that fre the correlation between true-score and 
error-score components, is zero. Next the correlation between two 
fallible seta of measurements involving the same true-score com- 
ponent is obtained as an empirical estimate of reliability 


fu = Ar 99 (38) 


where it is assumed that r,,-, r,, and r,,' are zero and c, = c... Given 
these results, it is evident that r,, in (38) is equal to r,,?, the square 


of the value in (37), so rı. = ~/r,,. Also, (37) or (38) can be solved 
for е, in terms of observables 


Or = Olie = 0. Vf (39) 
When these values for the тү, and c, are substituted into equation 
(85), the estimation equation under the assumptions of the true- 


score regression model is seen to be 
= Vra (М), 
(5) b 
= rl, 


(where ч is assumed that the mean of the true scores equals the 
mean o the fallible scores). When the similar substitutions аге 
made into equation (36), the standard error of measurement for 
the estimated true scores is seen to be 


SEM(tr) = o, Vra VA ғ, (41) 
These developments provide a rather clear basis for compari- 
Pan Ма different estimates of а standard error of measurement 
associated confidence intervals for a given score. For they 
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| that the same с and ru would enter into the calculation of 
hof the four SEMs. 

omp isons of the values of SEMs for a representative set of 
es of rı are shown in Table 1 for the standard score case 
= 1.0). Here it can be seen that the order from smallest 
SEM is 


| SEM(m) < SEM() < SEM(fr) < SEM (dif) 

1 that SEM (tr) first increases and then decreases with increase 
Te from .00 to 1.00, whereas all other SEMs decrease regularly 
increase in Fer. 

he practical implications of these developments can be seen 
re concretely by considering the estimated scores and confidence 
ndaries that would be obtained for a given score when using а 


TABLE 1 
Comparisons of SEMs and Confidence Intervals 


Square of SEM Values for Different ru Values 


SEM (tr) SEM(t) SEM (fr) SEM(dif) 
"True-Score True-Score Fallible Sore Fallible-Score 
Regression Substitution Regression s 
Tull — ru) (1—т) a-rw 2(1 ы 
.00 .00 .00 .00 
.09 .10 19 - 
.16 .20 .36 К 
.21 .30 51 .60 
.24 40 64 .80 
.25 .50 75 1.00 
.24 .60 .84 1.20 
.21 .70 91 1.40 
.16 .80 .96 1.60 
.09 .90 .99 1.80 
.00 1.00 1.00 2.00 


SEMs and Corresponding Confidence Intervals when 
та ds .91 and o їз 15 
CREAR << 95 Confidence Bounds 
Obtained Estimated +2 SEM 
3 118.7 to 135.9 
0 190 121 to 139 
130 127.3 114.9 to 139.7 


130 130.0 117.2 to 142.8 
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well known test. At the foot of Table 1 is shown a comparison of 
the intervals estimated for an IQ score of 130 obtained with & 
test having a reliability of 91 and a standard deviation of 15 
IQ points. It can be seen that an applied worker, such as a teacher, 
might get rather different ideas about a person's IQ depending upon 
which information about confidence bounds was provided. 

Of importance in the present context is the suggestion that dif- 
ferent conclusions about the boundaries for an obtained score derive 
from the same computed summary statistics. This is contrary to 
what frequently seems to be assumed. In particular it seems to be 
assumed rather frequently that SEM (fr) and SEM (dif), in con- 
trast to SEM (tr) and SEM(t), would be used under quite dif- 
ferent conditions, involving quite different r coefficients. Here the 
suggestion is that the differenees are apparent but not real and 
that what should dictate choice of one or the other SEM are the 
factors which would lead one to prefer one or the other model for 
variability of an obtained measurement. It does not follow that 
because one has a correlation, rı, between two equivalent tests, 
he must use either SEM (fr) or SEM (dif), and that because he 
has a KR-20 or KR-21 coefficient, he must use either SEM (tr) or 
SEM (t). Given either of these kinds of coefficients, an investigator 
might use any one of the four SEM models. 

It should be emphasized also that because the models for reli- 
ability (outlined in the first section of this article) lead to some- 
what different values for an rj, still another source of difference 
in setting confidence bounds is introduced. On the same data, for 
example, the r; computed by means of equation (9) (the KR-21 
Lord model) leads to a value of .90, whereas the т computed by 
means of equation (21) (the KR-20 Hoyt model) gives a value of 
94. These differences will produce differences in апу SEM com- 
puted on the basis of an estimate of ry. 

These technical issues concerning the way error is to be estimated 
for a given set of observations do not in any way dispute the 
fact that different kinds of variability can be represented 28 
“error.”* Variability between scores obtained with tests that can 


з Аз the botanist regards a weed as “a flower out of place,” во the 
psychometrist can regard error as “variation out of place,” thus emphasizing 
that what is regarded as error at one time might До ре so regarded at 
another time. 
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be regarded as replacements for each other (as in computing 
equivalency reliability), variability between measurements ob- 
tained at quite different times (as in computing stability reli- 
ability), variability among responses obtained on a single occasion 
(as in obtaining internal consistency reliability) and variability 
between observers (as in computing conspect or inter-rater relia- 
bility) may be designated as "error" in any one of the formulae for 
reliability or SEM. The concepts of error represented by these ways 
of regarding “variability out of place” have no necessary relation- 
ship to the mathematical theory outlined here. The mathematical 
theory may be thought of as a metatheory which can be employed 
in conjunction with any one of the four (and perhaps other) the- 
ories stipulating the kind of behavioral variability that is re- 
garded as error. For example whether the basic observations are 
test-retest scores obtained over a ten year interval or separate mea- 
sures of thirst obtained on a given occasion, one can use either of the 
two rs or any one of the four SEMs discussed in this paper. 
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Ix many areas of inquiries, the variable of interest often does 
end itself to direct physical measurement; its mensuration, 
fore must rely upon the subjective judgments of the observer. 
t example, the evaluation of works in the arts and letters, the 
hing of certain experimental and clinical data in biomedical 
ch, psychology, sociology, and education are all but a few of 


d categories. Suppose we imagine & situation involving п 
is, m judges, and t categories. Each of the n subjects is ex- 
lined and assigned into one of the t ca independently by 
i of the m judges. Our task is to devise a metric to measure the 
ity of agreement among the m judges. 


amination, however, one recognizes that the chi square test will 
t suffice because (1) а significant chi square value simply signifies 
some or all of the m judges are si 
other, but says nothing about the agreement т 
1 insignificant chi square merely denotes the lack of evidence 


75 
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against the hypothesis that the relative frequencies of the m judges 
with respect to the £ categories are the same, it again says nothing 
about how closely they resemble each other, namely, the intensity 
of agreement. To take another view, the above mentioned situation 
may be viewed as & multi-judge rank correlation problem which 
allows ties in the t ranks. The rank correlation viewpoint, although 
theoretically acceptable, in actual application, the frequent occur- 
rences of tied ranks renders the resultant rank correlation cumber- 
some to calculate and somewhat powerless in meaning (Kendall 
1955). 

The purposes of this paper are twofold: (1) To describe а 
weighing procedure of the categories, i.e., to assign a value to each 
category based on a transformation from the data's own distribu- 
tion, and (2) to devise a coefficient of agreement of the judges cal- 
culated from the transforms of the categorical data. 


Theoretical Consideration 


Let a set of n subjects (or products) be judged by a set of m judges 
with respect to some attribute X, according to some descriptive 
eriterion. Assume that attribute X is conceptually measurable on а 
continuous scale, but due to practical limitations, direct physical 
measuring metrics are not possible; instead, it can only be rated in 
terms of an order set of nonoverlapping categories as described by 
the criterion. The set of categories may consist of, say, t categories; 
Cy, C, ++- C, -++ С, such that X; < X; if X; € Ca, X, € С, and 
С, < C, in terms of the continuum of X. The observed results of 
the above may be represented by the following array: 


DE ew pe 
8; Xu Xu Xu 5 
Sı Ха Xa Xu Хм 


S; Xa Xin Xi; Xt. 


Se eX s X. Xa 2, ow 


Where Xy is the judgment of the ith subject by the jth juds® 
and may assume the “values” of C; (i= 1,2, «++ n; j = 1,2, ° ™ 
and k = 1, 2, +++ t). In order to devise a metric to measure the 
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egree" of agreement among the m judges, it is necessary that we 
w delineate the concept of agreement and define the meaning of 
neasure. 

è simplest case of agreement may be illustrated by the case 
here two judges placing a single subject into one of two cate- 
pories A and B, such that A < B, say. Clearly the events AA and 
BB constitute agreement and the events AB and BA the disagree- 
ent. However, when there are more than two judges and two 
ories involved, the demarcations between agreement and dis- 
Agreement are no longer so clear-cut because the term agreement 
êgins to assume a meaning of gradation in a quantitative sense. Sup- 
ose there are three judges placing a subject into one of three cate- 
ories A, D, and C, and where A < B < C. Obviously, the events 
AAA, BBB, and CCC still constitute the agreement. But AAB, is 
inly in closer agreement than AAC, because B is closer to A 
han С. However, can we assume that AAB is in the same degree of 
ement as BBC? Is ABC in worse agreement than AAC? In view 
these unsettling points, what then constitutes the lower end of 
cement? The situation becomes more critical when there are 
udges than categories, (i.e, m > Ё), because then agreement 
n certain judges is inevitable. It is clear then that while we 
e no problem in defining the maximum agreement among ™ 
in terms of t categories (m, t > 2), we are facing a dilemma 
efining the other extreme of the agreement scale, namely, the dis- 
nent, or the minimum agreement. In order to bring the above 
it becomes necessary to 
learly defined 


mma into some manageable fashion, 
e the minimum agreement in terms of some € 


eria. 

While there are many different, ways to define the minimum 
‘Agreement, we shall only choose two of them; namely, a) the 
haximum within subject variance, and (2) the maximum within 
ubject entropy. As the names implied, the first measure is a matric 
Of agreement from the analysis of variance point of view and the 


ond is from the information theory point of view. Suppose 
gned to categories Cy те- 


t appropriate weights уһ may be assi 
f the ith subject, we have 


tively. Let S2 = variances of ratings о 


и ЙУ 


9,2 = i=l 


т—1 
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and let H, — entropy of ratings of the ith subject. 


t 
= – > Pa log Pi 


no. of judges rated the ith subject as Съ 


Pa = 
ik т 


(Lu 1968). 


We see in the case of perfect agreement, all т ratings аге in а 
particular Су, hence all have identical yix, then 


Siem Н;=0 
The maximum S? will occur if the т ratings are distributed (m/2) C; 
and (m/2)C;, because since 


Bir 


qc y 
m 2 


, 


ems 
m 


u- + Fı 0? 
т – 1 


= mcg (y — у)? 

is а maximum since (y; — y)? > (y; — y)? fori > 1 andr < Ё 
From the analysis of variance point of view this is the minimum 
agreement, i.e., the m judges have divided into two camps of opin- 
ions as farther apart as possible. The maximum entropy will occur 
if P, = 1/t for all û. In other words, the m judges assign any one of 
the t ratings to the ith subject with equal likelihood. From this 
point of view, this is the worst the judges can do, hence any agree- 
ment among the judges is at random, therefore, the agreement is 
not meaningful, and consequently, the minimum agreement. 

It is now abundantly clear that the minimum agreement 48 
defined by the maximum variance S? is not the minimum agreement 
defined by the maximum entropy Ну. To put it in words, while 


2 


т 2 
Өш л E Qu 


may be а maximum variance, but its entropy Н; is certainly not 8 
maximum. A fact easily discernable is that there is perfect agree- 
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ment among each set of m/2 judges. On the other hand, the random 
assignment of rating to the ith subject may yield a maximum 
entropy H, but the within subject variance is certainly not the 
largest. 

Thus, the variance measure 52 cannot account for the agreement 
within sets of agreeing judges, the entropy measure cannot dis- 
tinguish the difference between magnitude of agreement. For in- 
stance, suppose we have two situations, where the m ratings are 


O Sy md Sy md (@ Tw and Fur 


since у; < Y2 *** < yia € Yr, the variance measure will show the 
first case having less agreement than the second, but the entropy 
measures of the two cases will be identical. It would be desirable 
for the devised measure of agreement to utilize both the variance 
and the information view points. 

Since in reality, one seldom has maximum or minimum agree- 
ment, and if agreement is present and detectable at all, it must lie 
somewhere between the maximum and the random agreement. We 
shall, therefore, use the within variance under conditions of maxi- 
mum entropy. Thus, the coefficient of agreement, A, may be de- 
fined as follows: 


2 2 
д E A 
TH 
where ` 
S; = the observed within subject variance. Dr К 
or” = expected within subject variance under conditions of maxi- 
mum entropy, i.e., all ratings are equally likely 
i- g- (59) 
PriX4)y-5-57 21 P» : 
We see that 


д0 a 62 от 
A-1 as S? 0. 


А can never be of indeterminate form because он? > 0. One point 
heeds to be stressed here is that A < 0 as S? > си. In such саве, 
it implies that the judges agree less among themselves than random, 
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ie., more disagreement than agreement. Our present interest is to 
measure agreement, consequently, we do not entertain any S? > ox’. 
If it should occur, a coefficient of disagreement may be defined as 


E 2 
gos 


where S? > Sy? and 


2. m EN 2 
Sa = т — 1 =) (и, = y) 


It is quite clear that the feasibility of computing the coefficient of 
agreement depends solely upon our ability to obtain the appropriate 
weights y; for each of the categories Су. In the passages to follow 
we shall show such a procedure. 


The Determination of Appropriate Weights 
for the t Categories 


‘ Let n subjects be judged by m judges according to some criteria 
into ё categories. Since the range of variation of agreement is de- 
fined in terms of intra-subject variance, our next task is to assign а 
set of meaningful weights to each of the Су, such that the variances 
may be calculated. Let us regroup the m X n array as follows: 


(o Qe Is Cy Ур 


Jı mı Nie Nik Nis n 
J 2 Ma Ti»3 Nak Nzi n 
TB Ne т п» n 
J. т ы Tio Жү п. n 
35 Mı. Ne, Ny, Ni. mn 


where Ny. is the count of individuals placed in the kth category bY 
the jth judge. 

Let f(z) be the frequency function of random variable X de- 
fined in terms of С, and assume the first two moments of X exist: 


We see that each C; constitutes a segment of the area under the 
curve as follows: 
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Prix E Ca} = f fas. 


Now let random variable у be defined as the distribution function 
of X,say F(x) such that 


y= FQ) = | ед du, 


then 


dy _ dsl 
ds SO ae ta 


The frequency function of y is therefore 


ко = 12 Lay. 


- fo 
7) 

We see the transformation z — y has the unique property that re- 
gardless what the form f(x) might have been, the transformed 
variable y is always a variable following the uniform distribution 


oly) with p = % and e? = 34». 
In practice, we let P, = n,./mn, and 


y = Fa) = P um d dece 
It can be shown that 
EW) 23 and Е(5,) = is. 
We have thus obtained a set of values for the t categories. The 
transformed variable y is of à probabilistic scale in terms of the dis- 
tribution function of X (Bross 1958). Substituting the computed 
Vi's for the correspondent C's for each zy, we shall show an array 
of m X n entries 
Л Ja Ji з 
sox: с ср Xim 
о eee 
B. Xo тте Xi» 
Gr кү, Хаа Xu Xam 
= 1,2, + j-2L2,- 
Deren deco Ups Y: 
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Thus the above array may be considered as а two-way factorial 
design (subjects x judges) with m and n levels respectively. 

An analysis of variance may be performed as follows: 


df m.s. m.s. is estimate 
Between subjects n—1 8, e! + mo," 
Within subjects n(m — 1) Sè с? 
Total nm — 1 


The within subject variance under maximum entropy condition is 


by definition 
of =1 Dy - (2) sas 9 


A test of significance of A may be conducted indirectly. We 
shall choose to test the hypothesis that the assignments of subjects 
to the categories by the judges are at random, i.e., equally likely for 
all £ categories. Then we would expect that 


E(S?) = eg. 
Thus the statistic 


2 
T 
Cx 
is 32/9] distributed with n(m —1) degrees of freedom. If we reject 
the hypothesis that Е (52) = cj? we can conclude that A is signifi- 
cantly different from zero (Dixon and Massey 1957). 
Substituting the necessary quantities S? and од? into equation (1), 
we have the coefficient of agreement 


ox 6 8? 

А = Nd 
A remark concerning the appropriateness of the above analysis 
of variance operation would seem in order here. One might have 
noticed that the transformed variable y is uniformly distributed, 
whereas in analysis of variance, normality of the variable is ге- 
quired. The violation of normality is not unique with the present 
case in fact, the use of non-normal data as if they were normal are 
quite common in statistical works in education and psychology- 
For instance, the computation of the coefficient of reliability iS 
based on test scores of a dichotomous nature. The valid argument for 
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tolerating this violation is not that with company one has strength 
for wrong doing, instead, it is due to the robustness of the analysis 
of variance technique, the effect, of non-normality is negligible as 
long as the subclass numbers are equal (Tukey, 1956, 57; Scheffe, 
1964). The two-way factorial table in our present case certainly 
has this property. 


Illustrative Examples 


Suppose there are 12 subjects who each has performed a task. 
As shown in Table 1 each of the 12 subjects was rated by two sets 
of four judges each. Each subject was assigned a grade, say A, B, 
C, D, and F. 


TABLE 1 
a کک‎ 
І п 
EE ج‎ 
Ri R mE а mR а Е 
1 ү ИЕ Y- * ^ A E 
2 F E. Duns "noa 
3 P DUE Doo NE 
4 Dp. DBD o: BIDAR 
5 p D BS gom TOP» 
6 D ADFT HOS ССОО 
7 о G ОБИ 6.404 B. XP 
8 c ug. ВСС B tB O 
9 с. 2B aie 5 (EITA O 
10 B | B Bes ЕРТС) А 
n reu ss AC IK MAS 8 
12 A uA TNR aol pea ТТА 
дса раро оѓ Sa Tas La The кесүүсү chow © Mer 
for Set I than for Set П. 
The computation of weights and computation of 52 
SETI 
1 2 3 4 5 
i-1 
та 3n; Sape сит (5/n 
tel 
(Р) Ө МЕС 0 4.5 ‚09375 
2D) м 7.0 9 16.0 .33333 
300) 10 50 23 28.0 .58333 
EB) E ЕВ 33 37.5 ‚78125 
5(А) BE 80 42 45.0 .93750 
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SET II e 
1(Е) 9 4.5 0 4.5 .09375 
2(D) 10 5.0 9 14.0 .29107 
3(C) 10 5.0 19 24.0 .50000 
4(B) 9 4.5 29 33.5 .69792 
5(A) 10 5.0 38 43.0 .89583 


Substituting the calculated ys into the original table, we compute 
the analysis of variance for each set. 


dí ms (I) ms (П) 


Subject 11  .32753 .29089 
Within 36 .00543 .01774 
on (1) = Фуу — à У y,)? = .38907 — .29793 = .09194 
ou (П) = 1352 y — (8 32 y! = .32609 — .24573 = .08036 


Tests of significance 


.00543 _ “= 
0, = “09194 ^ .05906** ағ = 36 


6, = шта = 22075** df — 36 


The coefficients of agreement are: 


_ 09194 — 00543 _ 
A co OI, = 94094 


_ 08036 — 01774 _ 
Ay = 08036 = .77924 
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THE INTERPRETATION OF REGRESSION 
COEFFICIENTS IN A SCHOOL EFFECTS MODEL* 


ROBERT L. LINN 4x» CHARLES E. WERTS 
Educational Testing Service 


LEDYARD R TUCKER 


cerned with school effects have generally been used to obtain de- 
scriptive information about schools and students and to locate 
student background and school characteristics associated with stu- 
dent growth during the school years. The intense and often ir- 
reconcilable debate about the interpretation of Coleman’s (1966) 
monumental “Equality of Educational Opportunity” reflects the 
highly speculative nature of any inferences about the causes of 
change or growth in students. In light of these hazards it is perhaps 
premature to consider the further refinement of using survey data 
to construct a model simulating the effects of schools on students, 
eg., allowing for such questions as: What would happen to reading 
skills if we increased the number, variety, and quality of books 
available to students by a specified amount? Even if a reasonable 
simulation model cannot be constructed now, we believe that the 
Process of trying to construct it can be useful in several ways 


(Blalock and Blalock, 1968) notably: 


1. In clarifying the numerous debates about which statistical 
Procedure should be used. One cannot specify which statistics are 
appropriate until the (postulated) nature of the phenomena under 
Study has been specified. Thus attempts to specify a logical model 


University of Illinois 
Larcn scale multipurpose “shotgun” surveys in research con- 
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will require spelling out the alternate hypotheses in ways which 
will, for example, suggest whether changes in mean, variation, 
and/or relative rank are relevant. Many educational researchers 
have implicitly equated correlation with “influence” without con- 
sidering whether the particular influence will be reflected in а cor- 
relation. 

2. In clarifying which data need to be collected in the next sur- 
vey to help choose among the plausible alternative hypotheses for 
observed associations. It is a common experience for researchers who 
work out a series of alternate hypotheses to find that the data from 
previously collected “shotgun” surveys are not detailed enough in 
that particular area to provide any test between the alternatives. 

3. In specifying the values and goals of education. As has been 
repeatedly noted, much survey research is a post mortem on past 
models rather than a consideration of the potentialities for change. 
Thus, if only a few schools have a progressive program in a given 
area, when all schools are analyzed together, the small percentage 
of variance accounted for by the “school effect” may reflect mainly 
the mediocrity of the past rather than the potential for the future. 
A widely used method with only a slight impact may account for & 
larger percentage of variance than a very potent but rarely used 
technique. Consideration of goals may also help avoid the common 
mistake of simultaneously analyzing on the same output scale 
schools with very disparate goals. This error confounds exposure 
with quality of curriculum offered, thus hindering the possible 
discovery of methods for more efficiently accomplishing the partic- 
ular goals set by the school administrators. Furthermore, this prot- 
ess should lead to greater specification and discussion of alternative 
strategies for attaining the specified goals and the costs of each. 

In this paper we have attempted to illustrate the concrete and 
detailed mode of thinking required in the above process, by con- 
sidering how one might interpret regression coefficients in relation 
to several alternate strategies for allocating resources. While the 
real world certainly does not really operate according to the model 
discussed, at least predictions can be made which further research 
can either confirm or disprove. Optimistically, the result of this 
type of concrete consideration may be a more “reasonably specific 
formulation of what we do not know in a region of inquiry, of why 
we want to know it, and, in the favorable case, of how we might 
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eed to find out" (Merton, 1968). Our commitment is to the 
ng inference" approach enunciated by Platt (1964). 

А Regression Model 


It is customary to identify three classes of variables in school 
effects studies: measures of student “input,” measures of student 
“output,” and measures of the “school characteristics.” For illustra- 
tive purposes we shall use an example with a single input variable 
X, an output variable Y, and a school variable W. A linear regression 
model for this case would consist of: 


?, = BW, + В,Х, + А 
e; = Ya — Ya 
where 
X,, is the input measure for student i in school j, 
Y,, is the output measure for student i in school j, 
W, is the “characteristic” measure for school j, and A is а constant. 


` When the covariances of e with w (ie., Ce») and e with z (ie, C.) 
_ аге assumed to be zero, least squares analysis can be used to find 
В, B,, and А: 


"Coy SCC: 
в. = бит Ge 


^C, — С.С. 
RI RES 
A = Y — B.W — В.Х, 
Where 
X, W and Y are the means of X, W and Y, respectively, Sè, 
8.^, S,? are the variances of X, W and Y, respectively, and C.,, 
Con, and C., are the covariances between X and У, W and Y, 
and W and X, respectively. 


The variance of Y is 
S = S; T Se 
where 
s = CBS T GAB, 
| and S," is the variance of e. 
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If a regression model is used for simulation purposes it is im- 
portant to remember the kinds of theoretical assumptions being ; 
made: Г 

1. Most importantly, that a linear additive model with equal unit 
scale assumptions is at least plausible in light of what is known about 
the effect to be measured. Does reality operate on the least squares 
principle? 

2. That all input and school variables influencing the output are 
in the model equation(s). Within a linear model, the absence of 
relevant influences will, in general, lead to “specification error” 
i.e., bias in calculated parameters due to incorrect specification of 
the structural model. If, however, an influence on the output is 
uncorrelated with all independent variables specified in the equa- 

tions, its absence will, in general, not bias estimates of the weights 
for the variables included. The error term in a regression model 
represents all those unmeasured implicit factors which influence 
output, but are assumed or known to be uncorrelated with the in- 
dependent variables in the equation (s). For а more complete dis- 
cussion of specification error see Theil (1957). 

3. All independent variables are measured without error or ap- 
propriate corrections are introduced for such errors, е.5., COFFE 
tions for unreliability. Random measurement error in the output. 
variable should not bias regression estimates. Cochran (1968) re 
views the effects of and corrections for errors of measurement. 
Linear regression models are considered in sections eight thru eleven 
of Cochran's review. 

4. If the correlations among independent variables become very 

high and/or multiple forms of the same independent variable а 

entered into the analysis in different guises, not only will the stag 

tistical problems of multicollinearity (Farrar and Glauber, 1967) 

be involved, but reasonable substantive interpretation of such Iê 

sults may become almost impossible (Gordon, 1968). 

5. That unstandardized regression weights do not, in generis 
lead to estimates of the relative "importance" of effects but ma} 
nonetheless, be more suitable for making inferences about “it 
fluences” in a linear model. Blalock (1967) and Tukey (1954) havi 
argued that the unstandardized regression weights are to be Pi 

ferred to standardized weights or correlations in attempting to 8 
vestigate influences primarily because of their relative indepen?” 
of sample characteristics. 
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The requisite theoretical and statistical assumptions will usually 

be such as to render any interpretation of calculated regression 

weights quite tentative. Unless the study has been designed to 

elaborate the relationship of this model to reality, i.e., to eliminate 

alternate hypotheses about the nature of reality, the calculated re- 
gression weights may have no useful interpretation. 


Interpretation of Change 

Suppose that the school characteristic could be manipulated 
while keeping the distribution of the student input measure con- 
stant. Let there be two situations, а and £, corresponding to two 
distributions of W, i.e., two sets of values of the school character- 
istic measure for each school, Wj and Wy. 

Define Z as the difference between the two values of the school 
atmosphere: 

Z= Wis = wW a 
E i ів i a) 
Wis = Wia t 2. 

Suppose, further, that A, Bw, Bz, and Sê are constant for the two 
situations, Let Swa? and Swg? be the variances of W for the two 
situations; Cea and Crs be the covariances between Z and W 
for the two situations; Caes and Ceep be the covariances between 
X and W for the two situations; and буш and Cyes be the covari- 
ances between Y and W for the two situations. f 

The preceding assumes that there will be different distributions 


of Y for the two situations, Ya and Ya, such that 


Paa = B.W,, + В.Хи + А, Q) 
and 


Fins BW В.Ха + А. (3) 
It follows directly from (1), (2); and (3) that 


Vise m. Fija = B Ь 
Thus, B, is a measure of the effect of Z on the change in predicted 
output for any given individual student in school j. Further, if it 
is assumed that the mean error is zero in both situations, 16» 
& = ёз = 0 then B,, can be interpreted аз а measure of the effect of 
Zon the change in means on Y, since 


Ve = Ve = В,2, (4) 
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which when Wg * We yields 
B, = P UTR d 
Ws — Wa 


In general then, the regression weight for W can, in this model, be 
interpreted as the unit change in output (Y) that can be expected 
from a unit change in the school atmosphere (W), if input and 
implicit factors (е) remain constant. The regression weight for 
X indicates the unit change in output (Y) that can be expected 
from a unit change in the input (X), if the school characteristic 
and the implicit factors (е) remain constant. If both input and the 
school characteristic change one unit (e constant) then Y will 
change B, + B, units. If input influences the school environment 
and all other influences on school environment are independent of 
input, ie. W = В.Х + е then a change of one unit in X will 
have two effects: (a) to change the environment Bws “mits which 
in turn result in (B,,B,) units of change in output, and (b) X 
will, in addition, directly change output В. units. When the school 
atmosphere is a group composition effect which is necessarily in- 
fluenced by input, it is unreasonable to think of changing the 
school variable and maintaining input constant. It follows that no 
interpretation of regression coefficients can proceed apart from 8 
consideration of the nature of the effect being studied and the net- 
work or context in which the effect occurs. 

Further understanding of regression coefficients can be obtained 
by examining the relationship of B, to changes jn output variance. 
It ean be shown that under the assumptions specified above the 
change in the variance of Y can be expressed as: 


ne Seo 2USB.B,-- (Ses — eA) Bi (5) 


Consider three possible strategies for allocating resources (assum 
ing high X is “better” and higher W is more “desirable” in terms 
of output) to improve the school environment W: (1) “random 
allocation” of resources so that the level of student input is UT 
correlated with the change in school environment (Cea = 0); @ 
the “rich get richer” approach in which the better the students 
entering a school, the more resources are provided to improve ue 
school characteristic (i.¢., ra, = +1.0); and (3) the “compensatory 
education model” in which the worse the student entering а schoo 
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the more resources аге provided (1.е., Tae = —1.0). For case one, 
the “random allocation” strategy: 

2 24 2 
Su, — Sie, = BS, © 
Sus T Bas 


provided that Sw 2 3 Sue. 

Thus, in this case the square of the regression coefficient, for the 
measure of the school characteristic is equal to the ratio of the 
difference between the variances of the output for the two con- 
ditions to the difference between the variances in the school char- 
acteristic measures for the two conditions. For case two, the “rich 
get richer” strategy, the difference between the variances of the 
output for the two conditions becomes 


8,2 — 5,2 = 28,8, BoB, + (Ses — ЯН, @) 
since Cas — SSe. Similarly, for ease three the compensatory edu- 
cation model: 

5,2 — S," = —28,8,B.B. + (бтген Be" (8) 

since Cen = — S48». 
A simple illustration may help clarify some of the implications of 
equations (4) thru (8). Suppose X and Y were scores on а physics 
test at the beginning and at the end of a second semester of physics, 
and that W; was the expenditure per student for the second semester 
of physics at school j. Now suppose that the following regression 


equation was obtained: 


Pija БИ F 8X: +0. 
Given the assumptions of the model, a change in expenditure per 
student from Wje to Wig would result in & predicted score for a 
student at school j of 
Ў, = Ула (Wis — Wia)» 
ch Vane + .5(Z;)- * 
Н а change in state allocations resulted in Z; = K for all, then: 


P, — Fa = 5K, 
and 8,2 — 8,2 = 0. On the other hand, if there was no mean 
Change in allocations and, therefore, expenditures бе, 2 = 0) 
then 


39 05200 ce 
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Adjustments in the allocations to individual schools, while main- 
taining Z — 0, would have quite different implications for the 
variance of the output scores depending on the correlation between 
Z and X. If adjustments in W were made that were completely un- 
related to X as in the “random allocation" strategy, then the change 
in variance of Y would depend entirely on the change in the vari- 
ance of W, i.e., 

Syg? — Sj? = 25 (Sog? — 8,2), which, in turn, could be ex- 
pressed in terms of the correlation between Z and Wa, Tewa: 


Sis) — S, = 258, (2wa Soa + 8.). 


Thus, for case one when B, is positive 5,62 will be larger than 
Sya? if Towa is positive. If raya is negative then S,g? may be larger 
or smaller than 8,2 depending on the relative magnitudes of Swa 
and S. 

If adjustments for the illustration were made such that allocations 
were increased at schools with the initially best students and de- 
creased at schools with the initially poorest students as in the "rich 
get richer" strategy (ie. ra, > 0) then: 


8 S ВОС 25(5, — 5..2), 


= 8,(.87,.8, + .5tiwaSwa + .258,). 
At the extreme where r,, = 1 (1.е., case two) 


HUN = кс + Brie б, + .258.), 


which will, in general, be positive except when rea is negative and 
Sua is large relative to S, апа S, In the “compensatory education 
model” the allocations are increased at the schools with the 
initially Poorest students and decreased at the schools with the 
initially best students. The difference between the output variances 
for the illustration for the most extreme compensatory education 
model (ie, rg = —1) would be 


Sy’ — S," = —88,8, + 25(S,] — Se”), 


= 8,(—.88, + Or, Su. + .258,). 
The variance of Y for condition B can clearly be less than at 


condition o even when Tawa is positive. Generally тара would be 


expected to be negative since Ona = Сова — Suwa. 
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Obviously, many other illustrations could have been given. For 
example, B, and/or B, might have negative signs. The con- 
clusions about the variances would not be changed if there was 
a mean shift in allocations (ie, Z » 0). The implications of the 
model for а given situation can be readily evaluated, however, 
which makes possible explicit predictions about the effects of any 
set of Z, on the mean and variance of Y. 

In most practical problems more than one measure of different 
characteristics of the school is apt to be required. If so, the basic 
model becomes more complieated and terms must be included in 
the regression equation for each measure. The effect of manipulat- 
ing any one or combination of these aspects on the mean and vari- 
ance of the outputs, however, could be estimated in a similar fash- 
ion. 
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ANALYZING SCHOOL EFFECTS: 
ANCOVA WITH А FALLIBLE COVARIATE* 


CHARLES E. WERTS лмо ROBERT L. LINN 
Educational Testing Service 


Тнв analysis of variance, covariance method (ANCOVA) has 
been employed in nonexperimental school effects studies to control 
for differential input when studying the differential impact of 
schools as a categorical treatment factor on some output variable. 
One of the numerous hazards (Smith, 1957) to interpreting these 
ANCOVA findings results from the use of input measures known 
to have considerable errors of measurement. As a consequence, in- 
put may not be completely controlled, ie. the “adjusted” treat- 
ment variance which is labeled the “differential” school effect may, 
to an unknown degree, still reflect differential input. While no gen- 
eral cure for this problem is currently available, it is worth consid- 
ering what use might be made of reliability estimates and in what 
circumstance corrections for unreliability would not decrease the 
estimated differential school effect. 


Rationale for the use of ANCOVA 


It is, unfortunately, the case that many school effects researchers 
have used ANCOVA without any substantive justification other 
than the statement that input should be controlled. It is the case, 
however, that underlying ANCOVA is a mathematical model, 
Which casts the data in a framework that will be meaningful only 
to the degree that the phenomena being studied actually behaves 


like this model, To illustrate this problem consider a ease in which 
س‎ 
`The research ted herein was performed pursuant to Grant No. 
ORG-1-6-061830.0650 Project No. 6-1830 with the Office of Education, U. 8. 
epartment of Health, Education, and Welfare. 
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there is one input variable X and an output variable Y. The model 
for ANCOVA is 


Ү = A, B.Xa + ein (1) 


where A, = the Y intercept of the regression line for school j, 


B, = pooled least squared estimate of the within school 
regression slope, 
and ец ^ random fluctuations due to unmeasured factors. 


Y E — School #1 


School #2 


mere 
2 


School #3 


X 
Figure 1. The Model for ANCOVA. 


As illustrated in Figure 1, equation (1) when applied to school 
effects research asserts that the slope of the within school regres 
sion lines are all equal. Theoretically, this means that a student en- 
tering school #1 will on the average gain A,—4» more units of out- 
put than a student entering school #2, regardless of whether his 
input score is initially high or low. Thus the ordering of the inter” 
cepts indicates the ordering (in output units) of the schools i 
terms of effectiveness, regardless of whether the most effective 
schools get the students with the highest input scores ог nob (i6: 
whether treatment effects are correlated with the covariate). He 
any given school the additive constant is the same for everyone ? 
that sense implying that the school is “equally effective" fot sti 
dents of high as for students of low input scores. If a given scho? 
did not actually get students at a particular level of input, it iE 
sumed that if it did, it would do as well with them аз with thos? ; 
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actually received, i.e., that the regression lines may be validly ex- 
tended beyond their observed ranges. 

In actual research the homogeneity of the within group regression 
slopes can be examined to see if ANCOVA сап be applied with 
this assumption at all. However, the really crucial assumptions will, 
in general, be untestable, viz., whether the true effect is appropri- 
ately simulated by the ANCOVA model, especially whether all 
other influences operating before, at, or after input are in fact inde- 
pendent of both the treatment effect and the covariates. As а con- 
sequence, the interpretation of the ANCOVA results is speculative 
and should be so labeled. The findings will be useful only to the 
degree that they serve to engender scientific progress, not as & 
statement about the existence or nonexistence of school effects. One 
would be foolish to assert from ANCOVA results that а middle 
class suburban school would do as well with slum children. Despite 
these hazards, researchers in the physical sciences have found over- 
simplified models of reality useful, if only to have something speci- 
fic to disprove. 

For the purpose of considering the implications of random errors 
of measurement in the covariate we Will assume that the ANCOVA 
model, equation (1), is appropriate, especially the theoretical as- 
sumption that a common within group regression slope can be used 
to adjust for mean differences in the covariate (Smith, 1957). As 
a consequence the A; intercepts in equation (1) can be estimated 
from equation (2): 

А, = Ty са BX; @) 
where Y, = sample mean of Y;; for group 5 
and — X, = sample mean of X;; for group 1. s 
For heuristic purposes it is useful to substitute equation (2) into 
equation (1) yielding: 
y= Y В.Х; + В„Хи + en 8) 
1f, in fact, one first computes Y; and X; for each group and as- 
signs these values to individuals in the regression equation 
Yo BY; + BX; +} В„Х + ii, 


it will be found that the regression coefficients, В: = 1, Be Fe —Вз, 
and B, = B,, (the pooled within group regression slope). This pro- 
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cedure is a simple transformation of the dummy variable approach 
to ANCOVA discussed by Cohen (1968) and one could perform 
similar statistieal manipulations. Equation (3) is of heuristic interest 
because it shows that it is the within groups regression slope, not 
the between groups or total slope which is the theoretieally inter- 
esting quantity from the view point of this particular model. 

If the most effective schools get the best students (Le. тах > 0 
in equation 1) the between and the total slopes will differ from the 
within slope (violating the ANCOVA assumption that between 
and within slopes be homogeneous) but our theory as stated in 
equation (3) indicates that this will not affect estimates of the inter- 
cepts nor, therefore, the ordering of schools for effectiveness. Be- 
cause of the homogeneity of within group regression, the ordering 
and variance of the intercepts will be the same as the ordering and 
variance of the adjusted means. Given that the adjusted means sig- 
nificantly differ from each other, the alternate hypothesis that these 
differences may be due to unreliability needs to be explored. Be- 
cause reliability coefficients are usually unavailable, our focus will 
be on stating the conditions under which it is reasonable to believe 
that corrections for unreliability should (at least in theory) increase 
the spread (i.e., the variance) of the adjusted means. 

It should be emphasized that equation (2) indicates that the in- 
tercepts are a residual quantity representing that part of the output 
means not accounted for by the input means. Therefore, the find- 
ing of a residual school effect (adjusted between school) variance 
In no sense is positive evidence that the school does have an effect, 
only that we have not proven that it doesn’t. Even if input were 
adequately controlled, the numerous events happening outside of 


school during the study might well explain much of the residual 
variance, 


Unreliability 


According to the classical theory of unreliability, because of ran- 
dom errors of Measurement, a person whose score is deviant OF 
one test will, on the average, have a score closer to the mean оп ® 
parallel form of that test (Lord and Novick, 1968, pp. 64-66). On* 
problem in correcting the covariate for unreliability is that a prio” 
we do not know whether on а parallel form the scores for an indi- 
vidual will tend to regress towards the mean of his school (i 
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school means are infallible) or towards the overall covariate mean 
(fallible school means). The former alternative would imply that 
the school mean itself should be the same on both parallel forms 
whereas the latter would imply that the school mean itself would 
"regress" towards the overall mean on the parallel form. In actual 
school effects studies such alternatives can seldom be tested because 
of the lack of parallel forms and lack of independence of the mea- 
surement errors, from the school effect, other covariates, and the 
output variable. Furthermore, even if an input variable like social 
class background could be perfectly measured, this variable is typi- 
cally used as a surrogate for variables like family values which 
means that lack of validity will be the more serious hazard to 
interpretation. 

It follows from the above discussion that two cases need to be 
treated: (a) when the errors of measurement in the covariate are 
distributed randomly with a zero mean for the total sample irre- 
spective of group, in which case the overall mean on the observed 
covariate is assumed to be equal to the mean value of the true scores, 
ie, the observed group means are assumed to be fallible and (b) 
when the errors of measurement in the covariate are distributed 
randomly within groups and have a zero mean within groups, їп 
Which case the observed group means are assumed to equal the 
mean value of the true scores for the persons in that group. For 


Case (a) 

1.1 The variance of the true covariate scores в.) will equal: 
c = rael, where ry, = reliability, and ог = observed 
variance of the covariate. 7 

1.2 The variance of the true group means (cz) when assigned to 


individuals in equation (3): 
сг? = трхсх". 
1.3 In terms of true scores indicated by lower case letters, 
Y, = BY; + Bat; + Batis + би. 
14 The normal equations are: 
сур = Biop + Boog: + Bars, 
oy: = Bilora + Вс, + Вз, 


and y 
сү; = Bicy. + Bos. + Во, . 
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1.5 To obtain the normal equations in terms of observed scores the 

following relationships are useful: 

a. The covariances among the true scores are identical to the 
covariances among the observed scores (DuBois, 1957). 

b. When group means are assigned to individuals as in equation 
(3) the covariance of X with X will equal the variance of 
the group means, i.e., ox = ox’ and тур = оғ. 

с. Likewise the covariance of X with Y will equal the covariance 
of X and P, ie. oyp = окт. 


1.6 By substitution the normal equations become 
op = Biog + Booze + Bors, 


ope = Biorg + Вххоу + Вуххот’, 
and 
сүх = Borg + Вххот + Borxxox’. 


1.7 Solution of these equations yields 
B, = 1, В, = —B;, В, = = 


where B, = pooled within group regression for observed scores. 
1.8 Since the true intercepts equal B,Y, + В,7,, the variance ОЁ 
the true intercepts (c,”) will be: 


c, = Boy’ + Во + 2В,В,от,. 
1.9 By substitution 


Booy 


B 
hes Txx 


2 
a тр? T 


1.10 This may be compared with the observed variance among 
intercepts (04°), 


СИТ 
74 = су + Bog — 2В,остх. 


ЮЙ Comparison of c," to c," indicates the “true” variation of the 
intercepts may be larger or smaller than the observed variation 
of the intercepts. 

1.12 Since opg + ox” = Вуз, comparison of step 1.9 to 1.10 indicates 
that for ca” to be greater than са? either: 

а. B, and Вуз must be opposite sign, or 
b. if B, and Brg are of the same sign then the absolute vali? 
of B, must be greater than twice the absolute value of Br 
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When the group mean is the expected mean value of the true scores 
for the persons in that group: 
2.1 The observed variance of the covariate means will equal the 
variance of the “{тце” means: 


2 2 
с, = ох. 


2.2 When it is assumed that the within group reliability of the 
covariate scores is the same for all schools, the within group 
variance (т„„?) of the true scores will equal the reliability (rxx) 
times the within group variance (exw); 


tx = Txx0xw = rxx(ox" F су”). 
23 It follows that the total variance of the true scores (c^) is 
c; = сг + gu. = ex + Тхх©хҥ°- 


2.4 The regression equation equivalent to equation (3) is (lower 
cases indicating true scores) 


Y, = B,Y, + Biz, + Beza + ба. 
2.5 The normal equations are: 
туу = Buoy? + Взор: + Boots, 
oy: = Baogs + Bio? + Boss, 
ала 
cy, = Buoys + Bares + Bus. 
2.6 By substitution as before 
os? = Bioy + Borex + Вьорт, 
orz = Byz + Вюх + Boos 5 
and 
сүх = BOTE ар Bis + Bor + texoxw )- 
2.7 Solution of these equations yields 
В, = 1, В, = By and B, = В. = pooled within slope. 
2.8 Since the true intercepts equal B,Y, + Bu, the variance 
of the true intercepts (т) will be: 


yy = Boop? + Bo: + QB Bsors 
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2.9 Which by substitution is 


з 6 
в? = су + Baer. Ф? Bz ors. 
Txx Txx 


2.10 This, again, may be compared with the observed variance of 
the intercepts 
са" = ст + Beg" — 2B.ogs. 
2.11 A comparison of 2.9 and 2.10 indicates that c,' will be larger 
than ол? when 


92.9 
(Es E e) > (Beg — 2В,озх). 
Txx Txx 


2.12 Since Вет = epr + ex" we find that c," is greater than c," when 


з 
E, — 2P«Btt > p» 2B Bre. 
xr Txx 


2.13 It can be shown that the inequality in 2.12 will hold if 
B, + туу) > 2гууВ„Вех 


which for 0 < туу < 1.0 is true for the following conditions (8) 
В. and Bre have opposite signs or (b) B, and Brr have the 
same signs and 
Lr: 
IB.| > rm [Bez]. 

Thus e,’ is always greater than сл? when B, and Вуз have opposite 
digne and if B, and Bez have the same sign then c, is greater than 
oa when the absolute value of B, is greater than the absolute value 
of Bre multiplied by 2rzx/(1 + rxx). Since 2rxx/(1 + тхх) will 
always be less than 1.0, it follows that where В, and Bs have the 
same sign then ¢,’ > са? when |В„| > |Bss]. 

Discussion 

1. проте derivations ean be summarized as follows: 

m. the internal slope (B,) and the external slope (Вт?) 
агенту Sign, then a correction for unreliability should 02 
MOTD result in an increase in the spread of the adjusted means 
internal slope (By) and the external slope are of the 


same sign, then the effect of correctin е on 
hether the errors distri g for unreliability depends 
Miether the errors are distributed randomly (and with zero mneo? 
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| without respect to groups or are distributed randomly with sero 
means within groups. In the former case the group means are con- 
sidered fallible and correcting for unreliability should inerease the 
spread of the adjusted means if the absolute value of the internal 
‘lope is at least twice the absolute value of the external slope. In 
the latter case, the group means are considered infallible and eor- 
ections should increase the spread of the adjusted means if the 
‘absolute value of the internal slope is at least as large as the absolute 

C Value of the external slope multiplied by a fraction which depends 
“on the magnitude of the reliability. Since the former procedure is 
“more conservative than the latter, it should perhaps be preferred 
the absence of information about the distribution of the errors. 

M reliability estimates are available then those values could be sub- 
“Hituted into the relevant equations and inferences made about the 

t ect on the spread of the adjusted means. The procedures out- 
E. by Porter (1967) and Thistlethwaite (1968) deal only with 
һе case where the errors of measurement in the covariate are dis- 
tributed randomly and have a zero mean within groups. Our eon- 
sions for this case are in only partial agreement with those of 
"Thistlethwaite who suggested that whenever the absolute value of 


z 


the internal slope exceeds that of the external slope, reliability cor- 
Tections will increase the spread of the adjusted means. Our anal- 
угез show that the spread of the intercepts will also be increased 
‘hen the internal and external slopes have opposite signs or when 
the have the same sign but the absolute value of the internal slopes 
À ceeds the absolute value of the external slope times а fraction that 


On the relative magnitudes of the covariances among the covariates. 
Inthe case of multiple covariates, it is possible to derive the relevant 
normal equations as we have done and then to observe the effect of 
Varying reliability values on the spread of the adjusted means. 


eral data-analytic system. 
Cohen, J. Multiple regression as а gen data-analytic 
- Psychological Bulletin, 1968, 70, 420-443. 
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A ONE-WAY ANALYSIS OF VARIANCE 
FOR SINGLE-SUBJECT DESIGNS 


LESTER C. SHINE II AND SAMUEL M. BOWER 
The University of Dayton 


In Psychology, there has been а long standing conflict between 
single-subject and multisubject researchers, which tends to center 
about the question of whether or not statistics is really useful in 
single-subject research. Such writers as Sidman (1961) and Skinner 
(1953) emphasize precisely controlled, single-subject experiments 
as the most fruitful experimental approach in Psychology, with 
the role of statistics being limited to the use of elementary descrip- 
tive statistics such as means and standard deviations. The powerful, 
multivariate methods of modern statistics are considered to be inap- 
plicable because they ате primarily designed to deal with groups 
instead of individuals and because their averaging out processes 
tend to obscure individual differences. Other writers, such as Un- 
derwood (1957), argue that the best experimental approach in Psy- 
chology is to study groups of subjects to which the modern statis- 
tical inference methods may be applied. The purpose of the present 
paper is to show that it is possible to view certain single-subject 
designs in such a way as to make the technique of Analysis of Vari- 
ance applicable to them for the one-way сазе. Higher ordered de- 


signs will be considered in a later paper. 


. The design the authors wish to р 
identical in its layout with a stan 1 
repeated measures except that instead of using a group of subjects 
on which is taken a single trial of repeated measures across the 
levels of the Experimental factor, а single subject 18 used on which 
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is taken several trials of repeated measures across the levels of the 
experimental factor. The two designs are presented schematically 
in Table 1 for comparison purposes. 

In the standard repeated measures design, a pseudo-random fae- 
tor (Subjects) is usually introduced to allow, with certain assump- 
tions, for the fact that repeated measures are taken on the subjects 
(Winer, 1962). The purpose of the pseudo-factor is to absorb any 
correlation between paired columns of measures on subjects, it- 
troduced by the presence of effects due to taking repeated mes- 
sures. It is necessary to assume that the correlation between paired 
observations under two different levels of the Experimental factor 
is the same for all possible pairs of such levels, in order for the 
usual F statistic to be strictly valid (Winer, 1962). Usually, no in- 
teraction between the Subject and Experimental factors is permitted 


TABLE 1 
ае day Mieres Design 

B (Conditions) 
ee ed 

1 2 eee q 

КЕН. 
Xn Xn EI Xu 
Xn Xn ... Xu 


use such an effect is completely confounded with error effecta 
ad cannot, therefore, be estimated separately. > 
The major assumption that is made for the proposed single sub- 
ect design is that the subject may be viewed as а response geoera- 
t the responses of which to a particular stimulus are statistically 
pendent and normally distributed about a central response 
зе. At first glance this assumption appears to be a contradictory 
lement since it appears that taking repeated measures on а single 
ect could very well introduce a correlation between the data 
т one level of the Experimental factor and the data under an- 
such level. It can be shown, however, that any such correla 
сап be carried, in a manner similar to that of the standard re- 
este measures design, by certain effects in the proposed design 
thout affecting the statistical independence of the observations. 
"Before a detailed examination of the assertions of the preceding 
Bragraph is undertaken, the working model and assumptions for 
proposed design will be presented. Under the assumptions that 
le single subject's responses are statistically independent and nor- 


ate, as shown in Table 1, the Experimental factor (B) and the 
Па] factor (A). The mathematical model, definitions, snd assump- 


or are as follows: 


Xa = p + а + В, + Ba + t 


X,, = observation in cell ї,] 
и = constant 
a, = constant for each i = 1, 2, = ‚Р; F 
8, = constant for each j = 1, 2, "a % 8, = Ls 
аб), = constant for each i, j; Lieb) = (а). 
€; ~ NID (0, с?) for each i, 7 


action of the two factors, and ey with all unaccounted or zb 


Sources of variation. The term р is 
lility that the average, across cells, of the expected value of a cell 


observation may not be zero. The use 
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certainly raise some serious objections in the reader's mind. The 
next few paragraphs will attempt to answer these objections. 

The first objection that may be raised is that there could be a 
correlation between the population of the subject’s possible re- 
sponses to a partieular treatment on a particular trial and the popu- 
lation of the subject’s possible responses to another partcular treat- 
ment on a particular trial. It is well known that a subject’s responses 
to а stimulus, even the same stimulus under identical conditions, 
will tend to vary randomly about a central response value, due to 
such things as random fluctuations in physiological variables over 
time, random perceptual oscillations over time, ete. If it is assumed 
that these conditions are operative in single subject research, then 
the two previously mentioned populations may be assumed to be 
independent. The random changes would be carried by the «у term 
in the model and the central response values would be carried by 
the other terms in the model. 

The second objection that may be raised is that there could be a 
correlation between the column of data under one treatment and - 
the column of data under another treatment, where the population 
correlation is defined to be: 


E Gy, = X000, — X4) 


and the sample correlation is defined to be: 


, 


Ы) 


Б (Xi, – к Б (К = | 


respectively. Under the assumption that the subject's responses 
are statistically independent, it can be easily shown that: | 


UTE 


2» læ: + (ab): lar (08) ;5,] 


BELICE 


i=l 


which, as сап be seen, may or may not be zero according 40 the i 
pattern of possible non-error effects. The important point to a 


ee 
_ 
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is that p;,;,, can be non-zero, even though the subject's responses 
are statistically independent, and that its non-zeroness is dependent 
solely upon, and thus carried by, the pattern of possible effects due 
to Trials and Trials X Treatments. 

The third objection that may be raised is that there could be a 
correlation between the row of data for one trial and the row of data 
for another trial. The situation here is conceptually the same as that 
of the preceding paragraph with i and j interchanged. Thus, any 
correlation would be carried by the pattern of possible effects due 
to Treatments and Trials х Treatments. 

A fourth objection that may be raised is the questioning of the 
use of a fixed factor for Trials. The trial factor in the proposed 
design essentially takes the place of the subject factor in а standard, 
repeated measures design. Since it is usually assumed that the group 
of subjects in such a design is a random sample, it becomes natural 
{о consider the subject faetor to be a random factor. If the trial 
factor could be considered to be a random factor, then, under the 
assumptions of this paper, the proposed design could be handled 
in the same way as a repeated measures design. Thus, the error term 
for testing the main effect of treatments would simply be the mean 
square due to the interaction of treatments and trials. The assump- 
tions behind a random factor do not appear, however, to be gen- 
erally feasible for the proposed design. The inherently sequential 
nature of the learning process, with which most single subject re- 
search is concerned, tends to cause the pattern of central response 
values under а particular treatment to be non-representative of the 
pattern that would exist when asymptotic learning has been reached. 
To meet the conditions for a fixed factor, it is only necessary to 
assume that the pattern of p central response values, associated 
With the p trials under a particular treatment, would remain the 
Same if the experiment were theoretically replicated with the same 
Subject under identical experimental conditions. The maintenance 
of such conditions across replications would, of course, be impos- 
sible in the practical sense. The above assumption required for 6 
fixed factor appears reasonable to the present authors and he dian 
factor may therefore be considered to be fixed. There remains the 
Problem of constructing an appropriate error term for testing the 
effects in the working model. This problem will be dealt with in 
the next section. 
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The Error Term for Testing Effects 


The conclusions of the previous section can be summarized by 
saying that the proposed design may be viewed conceptually as a 
two-way, fixed factor ANOVA with one observation per cell. Since 
there is only one observation per cell, the usual within cell estimate 
of в? cannot be used for testing main effect and interaction sources 
of variation. It is, therefore, necessary to construct a suitable esti- 
mate of в?. If it can be assumed that a; the main effect term for 
trials, changes rather slowly from one trial to the next, then an esti- 
mate of c? can be based upon the differences between X;, and Xi+1,. 
across odd trials. This assumption seems very reasonable for experi- 
ments focusing on learning processes and, of course, could be ap- 
plicable to many other types of experimental situations. 

The estimate of о? that the present authors wish to propose, 
under the assumption of the preceding paragraph, will be desig- 
nated MSE’ and is defined in the following formula: 


р-1 
1 s ia. — XK) , 
ЧАНА lass =e р = зр , where 
2 2 
" 1 p-1 
= aq s, а. X 


and where it is understood that an even number of trials, p, has been 
run. Now, under the assumption that o; = a;1 for odd т, then 
XT z X. = Gn. — e which is distributed independently, 
for odd 4 and normally with mean zero and variance (2/09. It 
follows immediately that E(MSE’) = o? and that (p/2)MSE'/7* 
is distributed as a chi-square with p/2 degrees of freedom. Thus, 
an approximate F ratio for testing the main effect of conditions would 
be MSB/MSE! with (g — 1) and p/2 degrees of freedom, and an 
approximate F ratio for testing the interaction of trials and condi- 
tions would be MSAB/MSE’ with (p — 1)(q — 1) and p/2 degrees 
of freedom. It should be noted that any violation of the assumption 
that o; = o, for odd 4 will tend to inflate MSR’ and will therefore 
tend to make the above F ratios conservative. 

An appropriate test for testing the main effect, of trials is the 
Mean Square Successive Difference test (Bennet and Franklin, 
1961). This test may also be used to test the assumption that 
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a = a1 for odd i. The formula for the Mean Square Successive 
Difference (MSSD) Test as applied to the present design is as 
follows: 


1 = A 
— 2 a хо 
аат 
1. X (үк Ж 
pm lm 


where 


MSSD = [ao = Gutta xo = 


and MSA is the usual mean square associated with the main effect 
of trials, Critical values for » for 5 per cent and 1 per cent one 
tailed tests have been summarized by Bennett and Franklin (1961). 
For p > 25 a z-test may be used by employing the following for- 
mula: 


z = (1— 2/2 Vp — Do 0/0 — 2). 
A significant y or z in either direction is evidence for a significant 
main effect for trials. A. right tailed significant 7 OT & left tailed 
significant z is evidence for rejecting the assumption that a; = eia 
for odd i. 


Schematic Calculation Procedures 


As before, let A stand for the trial pseudofactor and B stand for 
Source Table for 


the experimental factor. A schematic ANOVA 

the proposed design is presented in Table 2. 
First, lines A, B, AB, and Total may be filled out, except led 

F and » columns, by performing 4 standard, Two-way Fixe 


Factor ANOVA, with one observation per cell, on the data as if 


there were no repeated measures. Second, lines SD and E/ may be 
n at the base of the Source 


filled out by applying the formulas give 
Table bein к, ТЫ 2. Тыта, the slot for n and the slots for 
the two F ratios may be filled out according to the formulas given 
in the body of the Source Table. Fourth, A is considered to be a 
significant source of variation if y (OF z) is significant. Fifth, if is 


hot significant in the right tail (or 2 is not significant in the left 
tail), then the F ratios for the B and AB sources of variation may 
may therefore be interpreted 


be considered reasonably valid and 
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TABLE 2 
ANOVA Source Table (p = even only) 


Source SS df MS F Lu 
A SSA p-1 MSA = MSSD/MSA 
ISD — — MSSD = 
| 
в SSB q-1 MSB MSB/MSE' =ч 
АВ ' SSAB (р—1)@—1) MSAB MSAB/MSE’ = 
|E’ SSE’ p/2 MSE’ = = 
Total TSS ра — 1° M 225 = 
„с өы ORE SED ڪڪ‎ 
1 p-i A 
ase D Xen. Х,) 1 Я 
MSSD = += = X. X; 


BD T S x, — X 


ў 1 >-1 A 
SSE’ = 24 22, (Хз... — Xi) 


= a; Qn. X) хун, – Xd 


* May be replaced by z = (1 — 7/2) Vp — Dm TTDI Di 
р —1 +1 — 2), if p > 25. 
> The total does not include the figures er ы. S у 


in the usual manner. If 7 is significant in the right tail (or z in ts 
left tail), then there is evidence that the assumption a; = ai+1 for 
odd i is invalid, and the F ratios should, therefore, be considered 
to be invalid. Since, in this case, the F ratios will tend to be conse! 
vative, the individual experimenter may, based upon his individual 
situation, decide to accept the F ratios anyway. 


Summary 


The purpose of the present paper is to show that a one-way 
ANOVA may, under certain assumptions, be applied to a single 
subject design in which one subject is observed on several trials 
under each of several experimental conditions. It is assumed that 
the subject may be viewed as a response generator the responses 
of which to a particular stimulus are statistically independent 4 
normally distributed about a central response value. A fixed, trial 
pseudo-factor is introduced to carry any correlations introduce 
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in the data by using only one subject. It is shown that the one-way 
design may be handled as а two-way design, with one observation 
per cell, for which, under the assumption that the main effeets for 
trials change slowly from one trial to the next, а modified error 
term for testing effects may be constructed. A statistical test is 
given for testing the preceding assumption, and schematic calcula- 
tion procedures are presented. 
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AN EMPIRICAL NOTE ON CORRELATION 
COEFFICIENTS CORRECTED FOR 
RESTRICTION IN RANGE 


ROBERT A. FORSYTH 
University of Iowa 


Brnavormn scientists frequently utilize interval estimation tech- 
niques and point estimates of parameters for which there exists only 
relatively inadequate sampling theories. Correlation coefficients cor- 
rected for attenuation, for example, have estimated standard error 
formulas, but the form of the sampling distribution is not known. In- 
vestigators using such coefficients often employ the sampling theory 
for the Pearson product-moment correlation coefficient (Yamamoto, 
1965). The inadequacies associated with this procedure have been 
discussed recently by Forsyth and Feldt (1969). 

A fairly similar situation prevails when an experimenter wishes 
to estimate the correlation between variables X and Y for some 
population but complete data are available for only а select, non- 
random sample of subjects. In & common example, subjects have 
been selected on the basis of their X-scores and data for the 
Y-variable are available only for this restricted group. This situa- 
tion would occur if a college entrance test was utilized as & selection 
device and it was desired to estimate the correlation between the 
test and freshman grade point average for a population of students 
represented by all those who applied for admission. In this case, 
data on the test are available for both groups, but grade point data 
are available for the restricted group only. A formula is available 


which, under certain assumptions, yields an estimate of the corre- 
s is sometimes labeled the 


lation for the complete population. This is som 
correlation coefficient corrected for restriction in range. (Actually 
ations to which the restriction 


there are several slightly different situ 
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in range label applies. However, the situation described above is 
the most common.) Both Gulliksen (1950, pp. 136-137) and Lord 
and Novick (1967, pp. 142-143) have derived the following formula 
for estimating the unrestricted correlation between X and Y when 
the variance of X is known for both the restricted and unrestricted 


groups: 


зу 
Tor = Мз. ERG а) 


where 1 
s, = variance of the unrestricted group. 
s," = variance of the restricted group. 
Tase = correlation between т andy for the restricted group. 
r, = correlation between z and y for the unrestricted group. 


In addition to the assumptions of linearity and homoscedasticity 
the population counterparts of the sample estimates must be 
employed in the derivation of this equation. If the assumptions are 
met and if population facts are known, then Formula (1) will give 
the exact value of р.у, i.e., the population correlation for the unre- 
stricted population. However, population facts are never known 
and hence sample estimates must be utilized in the formula. Conse 
quently, this raises the issue of sampling error in the estimate, аһ 
issue which can be resolved only by a consideration of the sampling: 
distribution for the rey obtained from Equation (1). The form © 
this sampling distribution is not known nor are any standard e 
formulas available. 

Although many investigators may merely desire point estimate 
of these correlation coefficients corrected for restriction in гапё& 
(Humphreys, 1968) some researchers (Yamamoto, 1965) may wish 
to establish confidence intervals or test hypotheses about the рор 
lation value of the corrected r. | 

The primary purpose of this study was to examine empirical 
sampling distributions of corrected r’s obtained via Equation q» 
The investigation was not concerned with the appropriateness oi 
Equation (1) for yielding “good” point estimates of psy- Rather, ™ 
was concerned with the accuracy of the nominal confidence 
efficient of the obtained confidence intervals for pa, when Eq ч 
tion (1) was utilized and when it was assumed that these corre e 
1s were distributed as Pearson product moment r's. When it ¥ 
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found that the traditional theory was grossly inadequate, an at- 
tempt was made to develop a more appropriate technique. 


Procedures 


Corrected r’s were considered for three sample sizes for the 
restricted group (25, 50, and 100), four cutoff points (Pio, Pss, Paa: 
and Р), and two population values of the correlation (80 and 
50). Thus, 24 different empirical sampling distributions were gener- 
ated utilizing the procedure described below. 

A computer program was written to yield normally distributed 
scores X and Y with zero mean and unit variance and predeter- 
mined linear correlation psy. For each pair of X and Y values 
generated, the X-value was compared to & given cutoff point. И 
the X-score was greater than the cutoff point, the pair of scores 
was utilized in the analysis. This 
desired sample size was attained. Once the sample was completed, 
the necessary statistics for the, right side of Equation (1) were eal- 
culated and then the equation was solved for Tey. This process was 
repeated 1000 times to form an empi ical sampling distribution of 
corrected r's with a specified N for the restricted group, a specified 
cutoff point, and a specified population r. j 

Tt should be noted that this procedure made the cutoff point а 
fixed parameter but allowed the sample size of the unrestricted 
group to vary. This meant that the variance estimate of x for the 
unrestricted group (8:2) was based on different sample sizes from 
one replication to another. Of course, the sample size of the ^ 
stricted group was constant over all replications. (At this time 


should be indicated that several empiri à 
r's were obtained using the population value of az? in Equation m 
The use of o,2 in this equation would be feasible, for example, in t^ 
entrance test situation described above In hioc 
gator could use the test norms, under certam: 4 

a? instead of obtaining s," from the unrestricted sample, This, of 
course, solves the problem associated with 
different size samples in the same $ 


the results of these distributions were Very 
8 = ЕТА . ing s 2; f 
ponding distributions using Sz ) distributions were obtained, the 


After the empirical sampling “ai 
means and variances were computed for each of the 24 distr) 
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botions. Finally, Fisher's z-transformation was utilized to estab- 
lah three confidence intervals (.90, .95, .99) for psy for each of tbe 
1000 statisties in а given empirical sampling distribution. The pro- 
portion of these intervals which contained psy was calculated and 
compared to the nominal y-level. The difference between these 
two values was used as an index of the amount of error present 
when the traditional Pearson produce moment confidence interval 
procedure ix utilized with restricted r's. These differences were ob- 
tained for all 24 empirical distributions. 
Results and Discussion 
‘The means and variances of the 24 empirical sampling distri- 
betions along with the means and variances of corresponding 
Mee product moment sampling distributions are given iP 
Tt саз easily be seen from Table 1 that there is a tendency for 
the means of the empirical sampling distributions to be less than 
‘the corresponding means of the Péarson product moment distri- 
bation. However, except for two means (468 and 444) when 
fs = 50 and N = 25 the discrepancy was not too great. As the 
sample site increased, the means of the empirical distributions 
(betaine very similar to the product moment mean. However, it is 
equally apparent from Table 1 that the variances of the empirical 
distributions differed markedly from the variances of the corre- 
TOREM Bett алыны. Furthermore, these diserep- 
N cobre size increased. For example, when 
and pry = 50 the ratio of the largest corrected т variance 
Qd Ше Pearson т variance was 3.83. When N = 50 and pz, = 50 
Same ratio was 3.52 and when № = 100 and pay = .50 the rati? 


ей aa when p, = .80 were 3.01, 2.29, 
раа is some indication that as pz, increases, the various cutoff 


have less influence of the variability of the sample estimate 
beni E eat that the variances of the corrected rs 
higher regardless ‘hy ei as the cutoff point becomes increasing! 

In general, mmo Of pay. 
Pi eed: of given in Table 1 support the contention 
intervals for p., will Fishers z-transformation to establish confident? 
Рен Wi prove to be relatively inaccurate. The extent 


| 
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TABLE 1 
Means and Variances for Sampling Die Annum 


Mop p nM و‎ == 


Ae Сша Ms V artem 
س‎ 
=" 0 indi pm m 
25 .50 Pe ы un 
35 5% Ps AM ози 
35 5% Pa а м 
25 „д Pa e m 
25 .80 — тм ena 
25 .50 Pu тм r7 
2 .80 Pa ne LE 
35 -50 Pa тю Т 
» .80 Р» P4 p 
50 .60 — E is 
50 .50 Pa P oun 
ә s Р» лм sum 
50 .50 Pu ха ант 
50 .50 Pa К CE 
50 .80 — aT E 
50 .80 Pac . к 
50 ‚80 Pa К кп 
50 .80 Pa 76 хон 
50 .80 Р» лю к 
100 .50 — Ed ‚жа 
100 .50 Р» 06 к 
100 .50 Р» E к 
100 .50 Pu AX 0901 
100 .50 Ра E une 
100 .80 = тю КЫ 
100 .80 Pa Ed uns 
100 .80 Р» me Ed 
100 .80 Pu id - 
100 .80 Ps m 4 
* Values for the mean 70579 iere мнне produ semen элда eel 


wr taken from Soper et al, (1916, pp- 37-372) 


converted to z,-values via the formula, 2, = aie (a 420-2! 
Confidence intervals for z, were obtained by by utilising the traditional 
standard error of these z,-values, namely 1/ VV d on tion 
of these 1000 intervals which enclosed z, was calculated. ( course, 


this is the same proportion of confidence 
endose vif the nite of the interval were converted to correlation 
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values.) These results are shown as part of Table 2 (in the column 
under traditional standard error). 

A comparison of the empirical proportion with the nominal y- 
level provides a basis for determining the accuracy of the tradi- 
tional technique. The data in Table 2 furnish very strong evidence 
that fairly gross errors will be made if the traditional Pearson т 
sampling theory is employed for estimated r's. When the nominal 
y = .90, the empirical estimates of y ranged from .616 to 895. 
When у = .95, the range was from .703 to .942. Finally, when y = 
99, the estimated y's ranged from .818 to .989. Over all 24 distri- 
butions the empirical y's were on the average .11 units below .90, 
.09 units below .95 and .05 units below .99. 


TABLE 2 
Proportion of Confidence Intervals Enclosing pzy 


Nominal Confidence Interval 


Adjusted tional Adjus 
Standard St 


N 
z. 852 à 
100 . mm i i 
25 .8 Ps S87 (.913) 1988 (.955) 985 
2 ю Ds :895 (919) .942 (062) 980 ( 
» ют .883 — (912) .936 (:961) 989 
» x .837 (.892)* .888 (.953)* 961 : 
2 - .834 (.873) .866 (.936) 953 (.98 
о си .790 (.875) .870 (.935) os € 
» du .859 920 .910 (.952) 968 С 
юйю x (ш юш (m x б 
: Я 958 969 © 
= " Pe J4 ^ (.873)* .801 io so €) 
S EE та. (87) -798  (.925) ‚911 “ШШ 
0 0 Pa т (во) .813 (1920) .907 EE 
50 .80 Pa 818 С) 4883 (.973) 959 # 
S Gu 940 .890 (977) 971 (9 
(dee : (.914) .878 (967) 955 © 
35 Med бе (ёзу .703 ۰ (.929)4 sis b 
т: e2 (8) ‘716 (920) 836 € 
E (.846) -16 (2910) 835 Сш 
а ad 751 (.957) .823 (.978) 924 Dg 
50.8 Ра 751 (947) .837 (069) 936 _ 
" 785 (1931) 1800 (2970) 919 f 


* Standard error for ғ, when Ри is the cutoff point i 
ff point is: 1/./& — aay a | 
5 Standard error for s, when Pa is the eutof point is; 128 — 9- E 
* Standard error for ғ, when Pw is the cutoff point is: = ki 
4 Standard error for zr when Ps is the cutoff point is: 1/ VN 
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Three definite trends can be seen for these data (traditional stan- 
dard error data): (1) for a given ps, and a given cutoff point the 
amount of error is relatively independent of sample size; (2) as the 
selection procedure becomes more stringent the amount of error 
increases; and (3) as ps, increases the magnitude of the errors 
decreases. 


Approzimate Confidence Interval Techniques 


When the empirical sampling distributions of the corrected r's 
were obtained, they were compared to the corresponding Pearson 
product moment r distribution (ie. a distribution with the same 
pay and the same sample size). It was very apparent, as Table 1 
would indicate, that the corresponding pairs of distributions were 
markedly different. However, when the corrected r distributions 
were compared to Pearson т distributions for the same pry value but 
for a smaller sample size the two distributions become fairly similar. 
For example, the corrected r distribution for pay = 50, N = 25, and 
cutoff = Ру is more similar to a Pearson r distribution for psy = 50 
and N = 15 than it is to the sampling distribution for N = 25. 
These facts suggested that it might be possible to find reasonable 
confidence intervals by adjusting the N size in the traditional stan- 
dard error formula for z,. Of course, as Table 2 indicates, such 
adjustment would be directly related to the degree of restriction. 
That is, the higher the restriction the greater the decrease in sample 
size. However, it is also evident in Table 2 that such а technique 
Would be somewhat inadequate since the accuracy of the tradi- 
tional standard error formula for Zr is related to the value Ofpsy- 
Since merely adjusting the sample size in the traditional formula 
does not provide for varying psy values, апу techniques using this 
simple approach would not be completely adequate. Nevertheless, 
p decided to investigate the effectiveness of adjusting the tra- 

itional standard error formula for Zr- д 

A trial and error procedure was initiated to find these сони 
Standard errors. The adjusted values which seemed to provide 
reasonable confidence intervals are given in the footnote to Table 2. 
The actual proportion of confidence intervals which а 
When these adjusted values were used are shown ш Table 2. T 
is little doubt that the adjusted standard errors for z, provide 
much more accurate confidence intervals when compared to the 
traditional method. It must be emphasized, however, that these 
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standard error formulas are merely empirical values which seem te 
bold for the 24 distributions under study. | 


funetion in three new sampling distributions (1000 estimates in 
each). These three distributions are identified in Table 3. It may be 
noted that entirely different values of N, p. and p, were utilised. 
The adjusted standard error formulas for these distributions жеге 
obtained by а linear interpolation procedure utilizing the adjusted 


The adjusted standard error for s, when Ру is the cutoff point is 
1/ VN — BON — 3. The adjusted standard error when Р„ is he 
cutoff point is 1/V/N — 45N — 3. Therefore, when P.s is the cutoff 
point the correction factor is found by solving the following em 
pression, 3/5(45N — .30N) + .30N. Thus, when Py is the сш 
the adjusted standard error is 1/ VN — .39N — 3. The adjusted. 
standard error when P, was utilized as the cutoff was found by 
assuming that the denominator of the standard error formulas must 
equal zero when no one is selected (i.e., Po is cutoff point). | 

After the adjusted standard errors were computed, 1000 confidence 
intervals for z, were computed in each of the three distributions 
"The proportion of these intervals which would enclose p, were found 
‘These proportions are shown in Table 3. The proportion of confidence 
intervals which would enclose p, using 1/ УУ — 3 аз the | 
error are also presented. | 


TABLE 3 
Cross Validation Distributions: Proportion of Confidence Intervals Enclosing 
Population Correlation Coefficient 
Nominal Confidence Interval 
7 = 90 part NL 
Adjusted tional tional Adjusted 
Cat. Standard Standard ` Standard Standard Standard $ 
N peo ow Error Error Error Error Error 
40 .60 Р» .877 —— (.776) .934 T .992 
75 .70 Pe 189» (.726) .944 Ce) .985 
40 .65 Pro .904« (.590) .978 (.665) 900 2 


* Estimated standard error of г. = 1/VN — 39N — 3. 
* Estimated standard error of # = 1/VN — SIN — 3. 
* Estimated standard error of z = 1/ VN — 79N — 3. 
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The results shown in Table 3 are very encoursging. The adjusted 
standard error technique seems to give reasonably вече eonfidenee 
intervals. Therefore, it is recommended, (assuming that the bivariate 
distribution does not deviate markedly from bivariate normal) that 
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A COMPARISON OF COMPUTER-SIMULATED 
CONVENTIONAL AND BRANCHING TESTS' 


CARRIE WHERRY WATERS 
Center for Psychological Services 
Ohio University, Athens 
A. G. BAYROFF 


U. S. Army Behavioral Science 
Research Laboratory. 
Arlington, Virginia 


Ix the usual testing situation, each examinee takes all of the 
items and item sequence is the same for each examinee. Tt is possible, 
however, to have sequential or branching tests in which all exam- 
inees do not take the same items and the sequence of item presenta- 
tion for an individual is some function of his performance on 
previous items. The rationale for this latter procedure is that the 
presentation of items based on an examinee's past performance 
allows each individual to take items which are progressively more 
appropriate to his own level of ability. It is conceivable that such 
a procedure would reduce testing time and for a given amount of 
time would permit more accurate measurement, of an individual's 
ability. 

Krathwohl and Huyser (1956) reported the development of a 
6-item per subject branching test for college students which corre- 
lated .78 with a 60-item parent test and .68 with a reading test. 
Two experimental 6-item per subject branching tests for Army 
enlisted personnel (Bayroff, Thomas, and Anderson, 1960; Seeley, 
Morton, and Anderson, 1962) each correlated .63 with a 25-item 
conventional test measuring similar content. Assuming item inter- 


2 The opinions expressed in this paper are those of [е authors and do not 
necessarily reflect official Department of the Army policy. 
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correlations of .64, Waters (1964) compared a hypothetical 5-item 
per subject branching test with four hypothetical 5-item con- 
ventional tests, which differed in item difficulty distributions, and 
found the branching test correlated slightly higher with an under- 
lying ability criterion than did any of the conventional tests regard- 
less of whether free response or multiple choice format was assumed. 


Tests 
Conventional Tests 


Five-, ten-, and fifteen-item hypothetical conventional (C) tests 
were evaluated. All tests were symmetric around p = .50, but 
varied in item difficulty distributions. The distributions investigated 
were all items at p — .50 (C50), roughly normal (CN), or recti- 
linear (CR). Each of the CN and CR tests was tried out with 
difficulty ranges of .30 through .70 and .10 through .90. Table 1 
gives the C50, CN, and CR item diffieulty distributions for the 
five-, ten-, and fifteen-item conventional tests. 


Branching Tests 


One-Item Per Stage: Six hypothetical l-item per stage (1-PS) 
branching tests were evaluated. Two tests were studied at each of 
the three test lengths (5, 10, and 15 items). One of the two tests 
covered a difficulty range of 30 through .70 and the other ranged 


TABLE 1 
Number of Items at Each p-Value for Conventional Tests 


5-Item Tests 


10-Item Tests 


30-70 .10-.90 .30—.70 .10-.90 
Range Range Range Range 
all 


N 


R 


we € Q о "мү 


* No 5-item normally distributed tests were evaluated, 
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10 through .90. The 5-item per subject branching tests contained 


15 items with each examinee responding to only five of the items. 
In the 10-item per subject tests, each examinee took 10 of the 55 


! 
| 


el M—UO—————— c  — P 


items in the test, The 15-item per subject tests were composed of 
120 items. In each of the six tests, the first item (p = .50) was the 
same for all examinees, but the remaining items taken were deter- 
mined by the examinee's performance on the immediately preceding 
item. If an examinee passed an item he proceeded to а more 
difficult one; if he failed an item he proceeded to an easier one. 
When the range of p-values in а test was 30 through .70, increases 
and decreases in difficulty between adjacent items were in steps of 
05 for the 5-item per subject test, .0222 for the 10-item per subject 
lest, and .0143 for the 15-item per subject test. For the .10 through 
90 range tests, the steps were .10 for the 5-item per subject test, 
4444 for the 10-item per subject test, and .0286 for the 15-item 
per subject test. Figure 1 shows the possible branching paths and 
the scores assigned the terminal points for a 5-item per subject, 
one-item per stage (р = .10—90) branching test. Scoring and 
branching for the other one-item per stage tests were done in the 
tame manner. 

Two-Item Per Stage: Four hypothetical 2-item per stage (2-P8), 
10-item per subject branching tests were evaluated. Each of these 
tests was composed of 114 items. At each stage in these tests the 
examinees took two items of the same difficulty level. The first two 
items taken by all examinees had p-values of .50. If the examinee 
Passed both items in а pair he branched to а more difficult item 
Pair; if he passed one of the items in a pair he branched to а pair of 
equal difficulty; if he failed both items in a pair he proceeded to an 
easier pair of items. Items for two of the tests covered a difficulty 
range of 30 through .70, while the other two tests ranged -10 
though .90. For each of these difficulty ranges, one branching test 
Was developed by having equally spaced item pairs in the term- 
inal row of the test (2-PS-E). The p-values of the item TS 
the other rows were determined from the terminal item pair эю 
For the other 2-item per stage tests, 2-PS-U, (one for each siue 
item difficulty ranges) the item pair p-values were determined E 
branching downward from the р = -50 item pair ш) the €: 
tow of item pairs. Using this procedure, the item pis P Mt 
terminal rows were not equally spaced as in the ае она 
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Figure 1. Five-item per subject, one-item per stage branching test. 


were spaced so that the intervals between item pairs were smaller ( 
in the middle part of the difficulty ranges, and larger nearer the | 
extreme difficulty values. Scores for all four of the two-item per 
stage tests ranged from 0 to 62. 


Computational Procedures and Assumptions 


Statistical computations were based on a theoretical model pre 
sented by Lord (1952). The model assumes that there is a trait of 
ability underlying the raw scores on a test, and that the probability 
of an examinee's responding correctly to a test item is a normal 
ogive function of his position on the ability dimension. Since ite™ 
responses are а function only of scores on the ability continuum: 
they are independent of each other when ability is held constant 
‘When all of the items in a test are assumed to have the sa™° 
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biserial (R,) with ability, R? is an estimate of item intercorrelation. 
Three major steps are involved in obtaining the correlation between 
| test score and underlying ability: the proportion of examinees 
passing each item is determined for each of the ability levels under 
consideration; the conditional distribution of test seores is obtained 
for each ability level; and the bivariate frequeney distribution of 
test score and ability is obtained. 


Proportion of Examinces at a Given level of Ability 
Who Pass an Item 

When the group tested is assumed to be normally distributed on 
ability, Lord’s formulas (9) and (10) may be used to find the 
proportion of examinees who pass each of the test items when 
ability is held constant. In Lord's notation, a value of g, (the 2 
score corresponding to the P-value of item 1 at a specified ability 
level) is computed for each ability level under consideration by 
formula (9): 


a= hy Есе bere, 
* 


h; = the z value corresponding to the population p-value of item f 
R, = the biserial correlation between item i and underlying ability 
с= the z score representing the ability level being considered 
К,= V1-Rj 
Each g, is converted to P, (P-value of item i for examinees at а 
given ability level) by Lord's formula (10): 
Р, = m^ E D S t notmal ése НАНЕ E DN Р; 
values are computed for ability levels. 


Conditional Test Score Distribution for Given 
Ability Levels 
For conventional tests, the distribution of test scores at each ч 
the specified ability levels may be computed by expansion O 
Lord's formula (11): 


II (P. + 02 where, 


i=l 


I~, indicates the successive multiplication of the (Р, + 0.) terms 
n = number of items in test 
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P, = proportion passing item i for the given ability level 

Q. -1— P, 

Terms of this expansion give all possible ways of obtaining various 
test scores. Those terms which lead to the same test score are 
summed to obtain the distribution of test scores for a given ability 
level. 

Although Lord does not discuss branching tests, his model is also 
applicable to this type of test. For a branching test the proportion 
of examinees (at a specified ability level) following any path may 
be determined by multiplying the P, or Q, values (as obtained by 
Lord's formulas 9 and 10) of the items which make up that path. 
If an item is passed its P, value is used, if an item is failed its 0, 
value is used. Such a proportion is computed for each path and 
values for paths leading to the same test score are summed to 


Bivariate Frequency Distribution of Test Score 
and Ability 


For both conventional and branching tests, the bivariate distri- 
bution of test score and ability is obtained by multiplying the con- 
ditional test score distribution for each ability level by the ordinate 
value of the normal curve at that ability level (Lord's formula 14, 
applicable when a normal distribution of ability is assumed). The 
test-ability correlation coefficient may be computed from this scat- 


A FORTRAN program which performs these computations was 
written for the GE 225 computer by Mr. Sidney Sachs of the Com- 
puter Applications Branch, U. 8. Army Behavioral Science Research 
Laboratory. This program was used to obtain the test-ability co- 
efficients reported in this study. It should be noted that Brogden 
(1946), Tucker (1946), and Lord (1952) have provided computa- 
tionally easier formulas for obtaining the test-ability coefficients 
for conventional tests. 

In this study, the distribution of underlying ability in the the- 
oretical sample of examinees was assumed to be normal with X = 9 
and о = 1.00. Twenty-nine levels of ability, measured in stan 

scores ranging from +3.5 to —3.5 in steps of .25 were used. The 
biserial correlation between an item and ability was constant {07 
all items in a given test. For each of the 5- and 10-item per subject 


varied from .30 to .90 in steps of .10. The 15-item tests were 
uated at biserials of «40, .60 and .80. 


Results and Discussion 


Conventional Tests. The correlation coefficients between test score 
d ability for the 5-item conventional tests are shown in the first 
ree rows of Table 2. For biserials of .30 through .70 (ry = 09 


jns; the moderate range (3 through .7) and eventually the wider 
ge (.1 through .9) tests were best for higher intercorrelations. 


TABLE 2 
Test Score-Ability Correlation Coefficients for 5-Ttem Per Subject 
Conventional and Branching Tests 

Biserial 3 ..40 1:50 195680 10 Mae 
Test 

С (all .50) 4s» 61 696 760 $823 858 871 

€ (3-7, R) 473 51 66 762 819 m -4 

C (1-9, R) 44 59 66 7% 1793 Ж 

LB (3-7, 1-PS) 478 59 694 374 8S5 a 

В (.1-9, 1-PS) 461 580 680 760 826 


a : i = 90. 
rough .9 range test had the higher coefficient "ue 
Comparison of Conventional. and Branching Tests. One of the 


133 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Tow. > 860 (ry > 36). At the higher biserials, .70 through .90, 
both of the branching tests yielded higher coefficients than did any 
of the conventional tests. For the lower biserials, .30 through 50, . 
the C50 conventional tests resulted in slightly higher coefficients 
than did either of the branching tests. 


Ten-Item Tests 


Conventional Tests. The test score-ability correlation coefficients 
for the 10-item conventional tests are shown in the first five rows of 
Table 3. The data showed that the C50 test had the highest co- 
efficient for each biserial through .60. For these same biserials, all 
of the 3 through .7 range tests were next highest and the .1 through 
9 range tests were lowest. At тым. = .70, the C50 and .3 through .7 
tests were about equally effective, and yielded higher coefficients 
than the .1 through .9 tests. At biserials of .80 and .90, the original 
situation was reversed and the C50 test had the lowest coefficients 
and the .1 through .9 tests the highest coefficients. 


TABLE 3 


Test Score-Ability Correlation Coefficients for 10-Item Per Subject 
Conventional and Branching Tests 


Biserial EA) ENEOU SESTO 7.80 9 
Test 

C (all .5) 614^ 728 807 859 891 905 898 
C (3-3, N) eos 723 802 856 890 909 910 
C (3-7, В) 604 719 799 854 890 911 917 
С (1-9, N) 586 702 786 844 886 913 929 
C (1-9, R) 563 60 767 830 877 913 941 
B (3-7, 1-PS) 612 728 808 866 904 926 931 
B (3—7, 2-PS-E) 500 мо 737 800 863 898 915 
B (3—7, 2-PS-U) 512 633 721 799 851 885 м 
B (1—9, 1-PS) 601 по SOI 862 005 934 99 
B (1-9, 2-PS-E) 531 655 751 825 881 921 
B (1-9, 2-P&-U) B19 640 729 808 8502 80 918 


а Decimal points omitted. 


Branching Tests. The 10-item per subject branching test data is 
given in the bottom six rows of Table 3. The .3 through 7 1-PS 
tests tended to correlate higher with the criterion than did the 1 
through .9 tests through a biserial of .60. Above this level the con- 
verse held. It should be noted that for all biserials, and any given 
item difficulty range, the 1-PS branching test correlated higher than 
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у 2-PS test covering the same range. In fact, with only one ex- 
ception (ғы. = .90), both the 1-PS .1 through .9 and .3 through 
4 | tests yielded higher coefficients than did any of the 2-PS tests. 
2-PS-E tests correlated higher with the ability eriterion than 
d the 2- PS-U tests. 

“Comparison of Conventional and Branching Tests. One of the 

1-PS branching tests was superior to any of the conventional tests 
biserials above .40 (the 3 through 7 1-PS was highest at 
= .50 and .60; the .1 through .9 1-PS was highest at ro. = .70 
ugh .90). At a biserial of .30 the C50 test coefficient was slightly 
higher and at ry, = .40 the C50 conventional and .3 through .7 
I-PS branching tests had the largest coefficients. The 2-PS branch- 
ing tests compared favorably with the best ebnventional test only 
at very high biserials. 


Fifteen-Item Tests 


Conventional. All 15-item tests were evaluated at biserials of 
40, .60 and 80. The test score-ability correlation coefficients for the 
five conventional tests are given in the first five rows of Table 4. 


TABLE 4 
Test Score-Ability Correlation Coefficients for 15-Item Per Subject 
Conventional and Branching Tests 
[ Biserial 40 .60 80 
Test 
C (all .50) 792 896 923 
2€ (3-7, N) 787 804 923 
С (3-7, В) 785 894 930 
С (14-9, №) 164 884 936 
` € (1-9, R) 151 877 pei 
В (3-7, 1-PS) 793 903 s 
B (1-9, 1-PS) 786 902 9 


These data showed that the C50 test had the highest coefficient at 
biserials of 40 and .60. At a biserial of .80, the .1 through .9 tests 
‘both N and R) did best. A comparison of the 3 through .7 and 1 
hrough .9 tests across the three biserials showed that the narrower 
ange tests received higher coefficients at the lower biserials (.40 
nd .60) and the wider range tests did better for the high biserial 
(80). This general trend was consistent with the results obtained 
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for the 5- and 10-item conventional tests. For tests of a given 
range, those with approximately normally distributed item dificul- 
ties were superior to those with rectilinear difficulty distributions 
at the .40 biserial and did less well than their rectilinear counter- 
parts at a biserial of .80. At rou. = .60, no difference was obtained 
between the .3 through .7 N and R tests, but the .1 through .9 N 
test was superior to the R test of the same range. This same trend 
was also found for the 10-item conventional tests. In general, as 
biserials (and thus item intercorrelations) increased, wider range 
tests and tests with more rectilinear item difficulty distributions did 
progressively better. 

Branching. Data for the two 15-item branching tests are given in 
the last two rows of Table 4. The 3 through .7 test correlated 
higher with the ability criterion at the .40 biserial, while the Л 
through .9 test yielded the highest coefficient at ry, = .80. The two 
branching tests were essentially equivalent at the .60 biserial. 

Comparison of Conventional and Branching Tests. Both of the 
branching tests yielded higher coefficients than did any of the con- 
ventional tests for biserials of .60 and .80. At the .40 biserial, the 
C50 test was essentially equivalent to the .3 through .7 branching 
test. 


Effects of Test Length 


Table 5 gives the increments in test score-ability coeficients аз 
the tests were increased in length from five to fifteen items. In- 
creasing the number of items from five to ten resulted in increments 
in the correlations which were about twice as large as those obtained 
by increasing test length from 10 to 15 items. Increases in test 
length led to higher test score-ability coefficients for the lower 
biserial values. There appeared to be little difference between соп- 
ventional and branching tests in terms of the effects of increasing 
test length. 


Overview 
Both conventional and branching test data showed that tests with 
the least spread of item difficulties yielded the highest coefficients 
when low to moderate biserials were assumed. For medium to high 


biserials, the moderate range and wide range tests tended to yield 
coefficients of about the same magnitude. The wide range tests 50" 


À 
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TABLE 5 


Increment in Test Score-Ability Correlation Coefficients 
with Increase їп Test 


| 


Biserial .80 .40 .50 .00 .70 .80 .00 
C (all .50) 5-10 132 127 11 090 068 047 07 
10-15 064 037 018 
C (3-7, N) 5-10 
10-15 064 038 019 
С (1-9, N) 5-10 
10-15 062 040 023 
C (3-7, R) 5-10 131 128 113 092 071 050 030 
10-15 066 040 019 
C (1-9, R) 5-10 129 131 121 104 084 063 O04 
10-15 071 047 024 
В (3-7, 1-PS) 5-10 134 129 114 092 069 O46 025 
10-15 065 - 037 017 
B (.1-.9, 1-PS) 5-10 137 133 115 094 071 043 025 
10-15 066 039 018 


* Decimal points omitted. 


erally did best when very high biserials were assumed. These data 
are consistent with the 9- and 18-item test data reported by Brog- 
den (1946). The shift in the relative effectiveness of the narrower 
and wider range tests tended to take place earlier when test length 
was increased. 

For the lowest biserial assumed (.30), the C50 test was the only 
conventional test which correlated higher with the ability criterion 
than did the best branching test. At biserials of 40 and 50, and 10° 
and 15-item branching tests covering a .3 to 7 range and the C50 
test were essentially equivalent. For biserials of .60 and above, one 
of the branching tests always did better than any of the conven- 
tional tests. 

A comparison of l-item per stage and 2-item per stage branch- 
ing tests (at the 10-item test length) indicated that the 1-item рег 
stage tests had uniformily higher coefficients than did the 2-item 
per stage tests of the same range. In view of these results, it would 
not seem profitable to use the more complex 2-item per stage 
structure in the development of branching tests for the purpose of 
Maximizing overall correlation. 
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MINIMIZING ORDER EFFECTS IN THE SEMANTIC 
р 


ROBERT В. КАХЕ 
Pardee University, 


Hocsrox (1967) has devised a theoretical solution to the pret 
lem of order effects on responses in lesia or quevtionnsiows of A 
Hema, for k < 22, but there is no reasonable way to utile (Мне 
resulta if ordinary duplicating equipment is weed to prepare the 
materials. Kane (1969) provided а computer program that gn 
erates semantic differential (SD) questionnaires which taker st 
count of all three sources of presentation order effects in а D: (1) 
concept presentation order; (2) adjective scale onder; and (3) e 
polarity of each scale (which end is positive). Blas due to concept 
order and seale order are minimized by using the partieular риши 
tation of k items found by Houston to yield the minimam enter 
effect. Seale polarity is determined seale-by-seale by refervace ta а 
pseudo-random digit generating function. The program wat wiii- 
ten so that E may invoke or ignore subroutines designed to iet 
mize each type of order effect. For example, E may misimise 
due to scale ordering while holding coneept onder and scale polarity 


invariant. 


producible from the computer program plas the standard noneom- 
puter based format, were considered. 

Of the 36 pairs of strategies derivable from these nine, three чөгө 
compared to determine the utility of reducing order effects 


the‏ ب 
VTbe work reported herein was performed. parmant to а рш us‏ 
Of ot Eier D. B. Department of Health, Edocation, amd Welle.‏ 
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employing the SD. The study of each pair is designated as an ex- 
periment. 


Experiment I: Concept order, scale order, and scale polarity fixed _ 


vs. all three varied. 

These strategies should produce maximum differences with re- 
spect to order effects. 

Experiment II: A few concept orders, scale order and polarity 
fixed vs. all three varied, 

These strategies should produce differences comparable to those 
between SD questionnaires produced in the standard (non- 
computer-based) way and those produced by employing all the 
format variability available from the computer. 

Experiment III: A few concept orders, scale order and polarity 
fixed vs. concept and scale order fixed, polarity varied. 

This provides a comparison between the noncomputerized ques- 
tionnaire and one in which only scale polarity is varied. If 
significant differences in response patterns are found within 
Experiments II and III, and if the differences in Experiment II 
and III are comparable then it would be economically sound 
to generate SD questionnaires varying only scale polarity since 
it is simpler (thus less costly) than varying all three orders. 

One hundred fifty undergraduate elementary education majors 
were selected randomly as Ss. Fifty Ss were assigned randomly to 
each experiment. Within each experiment 25 Ss were assigned 
randomly to each treatment. Ten days after the first data collec- 
tion each 8 completed another SD composed of the same concepts 
and adjective scales, but generated by the opposing strategy. 

Each SD questionnaire was composed of nine concepts related 


to major curricular areas in the elementary schools each to be rated 
on 14 adjectival scales, 


Findings and Analysis 


Fifty-four (two treatments 
14 X 14 matrices of product 
Each was factored and rota 
1958, 1960). Since the prop 
by the first two factors ran 
III contributed more than 1 


X 8 experiments x 9 concepts) 
-moment correlations were computed. 
ted to the Varimax criterion (Kaiser 
ortion of total variance accounted for 
ged from 0.45 to 0.82, and since factor 
0 per cent of the total variance in only 


| 
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one case, only factors I and II were used as data sources for this 
study. 

Three comparisons between responses to the two types of SD 
questionnaires were analyzed in each experiment: 

1. Differences in rotated factor structure. 

2. Differences in factor scores concept-by-concept. 

3. Differences in response consistency. 


Factor Structure 


Scales with factor loadings > 0.80 were listed for factors I and 
П for each of the 54 rotated factor matrices. In the сазе of factor 
I these data then were compressed by recording only those scales 
with loadings > 0.30 for eight concepts out of nine. In the case of 
factor II the criterion for final recording of a scale was set at load- 
ings > 0.30 for seven concepts out of nine. Tables 1 and 2 list 
the scales which survived these screenings. 

In Table 1 there are 56 (ie., 4 strategies X 14 scales) cells in 
which а tally mark can appear. By changing the entry in just six 
of these cells identical matchings could be created in all four 


TABLE 1 
Sales With Factor I Loadings 2:080 In At Least 45 Out of 64 Cases 
Questionnaire Generating Strategy 
2 3 5 
1 dud Ч Ошу 
n All ncept Scale 
Orderi Order Polarity 

heavy-light 
active-passive x 4 x 
happy-sads x = х 
heavenly-hellish x x 
ast-slow X x x 
Positive-negative* x 2 
difficult-easy x x 
optimistic-pessimistic x x x 
Strong-weake x 8 
hard-soft x 
nice-awfuls x B x 
hot-cold x x 
good-bads x » * 
maseuline-feminine 


а Denotes scales chosen to represent factor Т in factor score study. 
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TABLE 2 
Scales With Factor IT Loadings 20.30 In At Least 36 Out of 54 Cases 


АА 


Questionnaire Generating Strategy 


1 2 3 4 
Only Only 
АП АП Concept Scale 
Orderings Огдегіпдз Order Polarity 
Fixed Varied Varied Varied 
——— c xe,  "Aneo ас 
beavy-light* x х х х 
happy-sad 
heavenly-hellish x 
fast-slow 
positive-negative 
difficult-easy* 2 x x x 
optimistic-pessimistio 
hard-soft* 
x > 
nice-awful x Y F 5 
hot-cold 
good-bad 
masculine-feminine 


= ا‎ 
* Denotes scales chosen to represent factor II in factor score study. 


strategy columns. Indeed, identical markings already exist for nine 
of the 14 scales. Although the strategy two column exhibits the 


ara 9 eviation from the other columns, these data indicate 
similarities among the columns. 


m changing only two entries out of 56 in Table 2 matchings in 
а four strategy columns could be created. With the possible ex- 
к of ves I, strategy 2, there seem to be no 

In factor structure among the four i 
; : { questi: 
ing strategies for either factor I or factor II. 


appreciable dif- 
onnaire generat- 


Factor Scores 


Five scales were chosen to represent factor I and three scales 
were chosen to represent factor II. 

А score from 0 to 6 was recorded for each S on each scale and 
mean scores of factor I scales as well аз factor II scales were com- 
puted concept-by-concept within each experimental treatment. 
Thus within each experiment there were nine pairs of mean scores 
for factor II. The difference between mean scores within each pat 
was analyzed by an analysis of variance model, 
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Table 3 lists the F ratios emanating from experiments I, II, and 
III respectively. 


TABLE 3 

F Ratios for Factor Score Study 
Experiment I Experiment II Experiment III 
Factor Factor Factor Factor Factor Factor 

Concept I II I п І п 

Language Arts 1.810 0.844 0.258 0.067 0.064 0.418 
Mathematics 1.881 1.715 0.508 0.197 0.039 0.146 
Science 0.051 0.717 0.646 1.449 0.602 0.054 
Social Studies 2.843" 0,524 1.114 0.829 0.050 0.336 
Teaching Children 0.746 1.246 2.173 2.767" 0.008 0.803 


Teaching Children 
Language Arts 0.346 0.280 0.580 1.294 0.330 1.200 
Teaching Children 


Mathematics 0.945 0.000 0.256 0.375 0.788 2.863% 
Teaching Children 

Science 0.026 0.006 0.007 0.401 0.002 1.126 
Teaching Children 

Social Studies 0.877 0.000 0.008 0.181 0.792 1.774 


* Significant at a. = 0.10. None of these F ratios is significant at a = 0.05. 


Of the 54 F ratios displayed in Table 3, none is significant, at 
the « = 0.05 level; only three are significant at the а = 0.10 level. 
In fact only six more are significant when the a level is advanced to 
025. Forty of the 54 F ratios are less than 1.000. These data indi- 
cate that no systematic differences in factor scores occur in any of 
the experiments. 


Response Consistency 

Аз a final reading of the differences between strategies а measure 
designated "response consistency" was devised. This measure seeks 
to answer the following question: How closely does S's response 
on the (n + k)th adjective scale conform to his response on the 
nth adjective scale? To answer this question the absolute value of 
the difference between the score on scale n and scale n + k was 
Selected as the measure. Thus 


Ў, ваа] 
C = LA RENE 


Where C denotes а response consistency index, Sn 
on scale п, $,.; denotes the score on scale (n + k), 


denotes the score 
and |з» — 
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8 +] is summed over all such differences within a given concept, 
Clearly, the summation could be made of all such differences pro- 
duced by a given S across concepts if one wished to do so. Sums 
ming within concepts and across Ss was done to conform with the 
other analyses made in this study. It was decided to let k — 1, 2, 3, 
or 4. By using 14-k in the denominator response consistency 
measures for k — 1, 2, 3, and 4 were transformed into comparable 
indices. 

If order effects are salient then differences when k — 1 should 
be less than differences when k = 2 and, in general C; < Co < Cs 
< C4. Table 4 displays response consistency indices for k = 1, 2, 3, 
and 4 for each concept from 200 of the SD questionnaires com- 
pleted by Ss in this study. 

An inspection of the nine columns of Table 4 does not support 
the existence of the order relation. 


€; «€ 6, € €, < Ce 


In order to determine whether row or columnar differences are sig- 
nifieant a two way analysis of variance was performed. Table 8 
includes the relevant data. 

The inequality of the indices across concepts suggests that the 
magnitude of the response consistency indices is related to the 
concept being rated. While the Ё ratio associated with within- 
column differences suggests that C, ~ C2 Cz == C4, no system- 
atic order relation among C1, Cs, Cs, and C4, was observed. There 
is no evidence of concept-index interaction. Thus while there are 
differences in response consistency these differences do not appear 
to be interpretable as indieators of order effects. 

This research supplied no evidence that users of the SD need to 
be concerned about item order effects as a significant source of er- 
ror variance. In Experiment I, where one treatment invited maxi- 
mum order effects and the other treatment minimized the sources 
of these effects no significant response differences were observed. 
In Experiments II and III, where the opposing SD formats were 
less profoundly different, the same result obtained. 

Subject to the usual constraints on the generalizableness of find- 
ings it appears the Es may cease worrying about the effects of & 
constant item presentation order when administering the SD. 
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TABLE 5 
A Comparison of Four Response Consistency Indices Across Nine 
SD Concepts for 200 SD Questionnaires 
С”———Є——————=-=:=_——————- 
Sum Degrees 
of of Mean 
Source of Variation Squares Freedom Square Р 
Response Consistency 14.12 3 4.71 10.5 
Indices for k = 1,2, 3, and 4 
116.58 8 14.57 32.4* 
Interaction 11.01 24 0.46 1.0 
Within Cells 3251.88 7164 0.45 
* Significant ata = 0.01. 
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BEHAVIORAL COGNITION AS RELATED TO 
INTERPERSONAL PERCEPTION AND SOME 
PERSONALITY TRAITS OF COLLEGE STUDENTS? 


C. M. N. MEHROTRA 
The Ohio State University? 


Тив Structure of Intellect(SI) Model proposed by Guilford 
(1967) classifies the intellect into operations which it can perform, 
different contents of these operations, and different products. The 
model hypothesizes four kinds of content—semantie, symbolic, fig- 
ural, and behavioral. The first three content areas are not unfa- 
miliar to the psychologists, as they were included previously in other 
models, for example, Eysenck (1953), and are seen in most of the 
existing test batteries. The fourth content area, the behavioral one, 
has been added to the model “to take care of operations pertaining 
to the behavior of other people” (Guilford, 1967). This area in- 
cludes feelings, motives, thoughts, intentions or other psychological 
dispositions which might affect an individual’s social behavior. 
O'Sullivan, Guilford, and de Mille (1965) have developed measures 
for six different factors of behavioral cognition hypothesized in the 
SI model and have shown that these abilities are factorially distinct 
from previously isolated intellectual abilities. But there is no re- 
search evidence to show whether the behavioral cognition factors 
are different from the related factors in the nonintellectual domain. 
The present study was conducted to achieve this goal. To be more 
Specific, the aim was to determine the nature of the relationship 


~mn . + 
This paper is based on а dissertation submitted to The Ohio State Uni- 
versity in partial fulfillment of the requirements for the Ph.D. degree. The 


author wishes to thank John E. Horrocks, Chairman of his Committee, George 
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of behavioral cognition factors with interests (social service, per- 
suasive, artistic, and literary), values (social, political, and aes- 
thetic) and personality variables (inclusion, control, affection, ex- 
traversion-introversion, sensing-intuition, judgment-perception, and 
thinking-feeling). It was hypothesized that if the behavioral cog- 
nition measures developed by O'Sullivan et al, (1965) have dis- 
criminant validity they will have low relationship with these vari- 
ables. 


Method 
Subjects 


The subjects were 100 male and 100 female college undergradu- 


ates enrolled in an introductory course in educational psychology at 
The Ohio State University. 


Psychological Measures 


The tests used in the present study were of behavioral content 
involving the operation of cognition and the products of classes, 
systems, transformations, and implications. Thus the following 
tests were selected: Expression Grouping for cognition of behav- 
ioral classes (CBC), Missing Cartoons for cognition of behavioral 
systems (CBS), Social Translations for cognition of behavioral 
transformation (CBT), and Cartoon Predictions for cognition of 
behavioral implications (CBI). A description of the four factors 
and their tests can be found in Hoepfner and O'Sullivan (1968). 
For assessment of interpersonal perception six filmed interviews 
(Cline and Richards, 1960) were used. Three of these interviewees 
were male and three were female. After viewing each film the Ss 
were required to fill out the three judging instruments: CPI Opin- 
ion Prediction Test (OPT), Adjective Checklist (ACL) and Be- 
havior Postdiction Test (ВРТ). The variables in the nonintel- 
lectual domain were assessed by seli-descriptive, objectively scored 
inventories. As far as possible two instruments were employed for 
measuring the variables in one area. Personality variables were 
assessed by using the FIRO-B (Schutz, 1967) and Myer-Briggs 
Type Indieator (Myers, 1962), interests by the Kuder Preference 
Record (Kuder, 1942) and values by the Study of Values (Allport, 
Vernon and Lindzey, 1960). Rationale for choosing each of these 
instruments is given in Mehrotra (1968) 
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Correlations were computed to separately examine the relation- 
Ship between each of the behavioral cognition scores and the in- 
terpersonal perception, personality, interests, and values scores. 
Multiple regression analysis was used to see how far it is possible 
{о predict the performance on behavioral cognition tests by using 
the weighted sum of scores on measures of interpersonal perception, 
personality, interests and values. 


Results 


Table 1 contains the intercorrelations among the five scores on 
measures of behavioral cognition for the total sample. All of these 
10 coefficients are significant at .01 level. This indicates that the 
four behavioral cognition factors are not statistically independent 
of one another. The Social Translations Test (CBT), which is the 
only test using verbal material, has the lowest correlation with 
other measures of behavioral cognition. These high correlations can 
be interpreted as indicative of mutual involvement of abilities in 
solving the problems with behavioral content. In terms of the cri- 
teria specified by Campbell and Fiske (1959), one might say that 
as these measures are conceptually independent (Guilford, 1964), 
the high correlations are indicative of inadequate discriminant va- 
lidity of these measures. However, in view of the fact that their 
construct validity has already been established by O'Sullivan et al., 
(1965), these correlations may be interpreted as reflecting the at- 


TABLE 1 


Intercorrelations among the Five Behavioral Cognition Scores 
for the Total Sample 


Tests 2 3 27-4129 M SD 


l. Missing Cartoons (CBS) 457" 841* 550* 770* 19.405 4.736 


i | 3.846 
Expression Grouping (CBC) SUN Sce Да Бич 

a aa Translations (CBT) ыз pu a we 1 p 

i ы (CBI) 77.710 13.155 


Note.—* Significant at p = .01 level, nondirectional test. Decimal points are omitted 


correlations only. 


Ше. 
5 Intercorrelations were also computed for the male and the female samples 


- feparately, Though there was a sex difference in the magnitude of r's, there 
Was no evidence for a sex difference in the patterning of the үйде: 
Correlating the matched 7's across sex had a Spearman rho of 40 which is 


Statistically significant (p < .05). 
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tributes of the testees and not as indicators of only the formal 
properties of the tests, as implied by the concept of discriminant 
validity (Kroger, 1968). 

Table 2 shows the relationship of measures of behavioral cog- 
nition with those of interpersonal perception. The Cartoon Predic- 
tions Test does not have a significant correlation with Opinion 
Prediction and Behavior Postdiction measures. However, its cor- 
relation with Adjective Checklist F (ACLF) is statistically sig- 
nificant, but that is also true for the correlations of other behavioral 
cognition measures with ACLF. When the correlations of behavioral 
cognition scores and ACLF were obtained separately for the male 
and female sample, they were significant only in the male sample. 

It may be recalled that the SI product category of implications is 
concerned with extrapolations from given information to cither its 
antecedents or its consequents. Cartoon Predictions, the measure of 
CBI, is based mainly on the consequent part of the definition. The 
measures of interpersonal perception are also concerned with the 
extrapolations from given information. The main difference between 


TABLE 2 


Correlations of Behavioral Cognition Scores and Interpersonal Perception Scores 
for the Total Sample (N = 200) 


Behavioral Cognition 


3. 2. 3. 4. 
Eo оса, po ma eivai ч 
(CBS) (CBC) (CBT) (CBI) Com 
26. Opinion © 
A P ua M —10  —-03 —058 —065 | 
c M e M 136 109 117 111 134 
Postdiction M — £ = 
T E T шы ш ооп сю 
á Ade F —020 014 027 009 
о. o 
í 
| 


* Significant at p = .05 level, nondirectional 
ъ Significant at p = ‚01 level, nondirectional ues 
Note.—Decimal points have been omitted. 
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these two sets of instruments is that of method: filmed interviews 
versus immobile photographs and cartoons. Since convergent va- 
lidity is represented in the agreement between two attempts to 
measure the same trait throught maximally different methods, it 
may be said that the data in the present study failed to provide 
enough evidence for convergent validity of Cartoon Predictions 
Test. It is possible that when immobile photographs, cartoons and 
drawings of faces are used, one may be measuring a variable which 
is different from what is being measured by using a filmed inter- 
view which provides a number of other cues e.g., tone, interaction 
with other persons, behavior in different situations. Other possible 
reasons for these low relationships include (a) low reliability of 
OPT, ACL and BPT which were used to assess accuracy of 
interpersonal prediction, (b) low reliability of Cartoon Predictions 
Tests which was used as a measure of CBI, and (c) the restricted 
range of scores on each of the measures employed in the present 
study. 

Table 3 shows the relationship of behavioral cognition scores with 
personality traits. Missing Cartoons, the CBS measure used in 
this study, has a statistically significant correlation with Sensing- 
Intuition and Judging-Perceiving Scales of the MBTI. This indi- 
cates that those who obtain high scores on CBS also tend to do well 
on (a) sensing (being aware of things directly through one or 
another of five senses) as opposed to intuition and (b) judging 
(coming to a conclusion by shutting off perception for the time 
being) as opposed to perception. The Expression Grouping (CBC) 
has a statistically significant correlation only with the Extraver- 
sion-Introversion Seale of MBTI; this shows that extraverted Ss 
tend to do well on CBC. Social Translations (CBT) is not cor- 
related with any of the personality variables, and Cartoon Pre- 
dictions (CBI) is correlated only with Sensing-Intuition. 

Table 4 contains the correlations of behavioral cognition scores 
with interests and values. Both Missing Cartoons (CBS) and Car- 
toon Predictions (CBI), the measures using cartoons, correlate 
Significantly to artistic interests. This indicates that Ss who enjoy 
artistic activities tend to obtain high scores on measures of CBS 
and CBI, which in the present study use cartoons. Further stud- 
ies are needed to see if the artistic interests continue to be correlated 
with CBS and CBI even when the measures do not use cartoons 


18. 
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TABLE 3 


Correlations between Behavioral Cognition Scores and Scores on Personality Variables 


Personality 
Variable 


Extraversion- 
Introversion 


. Sensing-Intuition 

. Thinking-Feeling 

. Judging-Perceiving 
. Expressed Inclusion 


for the Total Sample (N = 200) 


Behavioral Cognition 
1. 2. 3. 4. 
Missing Expression Social Cartoon 
Cartoons  Grouping Translation Predictions 

(CBS) (CBC) (CBT) (CBI) 

107 155 096 040 

2555 041 124 168* 

027 —031 081 098 

1895 033 107 054 
—015 —020 057 110 
—116 011 047 013 
—070 018 111 127 
—020 020 086 077 

001 —042 —115 013 
—101 —010 —058 008 


* Significant at р = .05 level, nondirectional test. 
» Significant at p = .01 level, nondirectional test. 
Note.—Docimal points have been omitted. 


ompo 


and drawings. Negative correlations of CBT and CBI with per- 
suasive interests show that those who enjoy persuasive activities 
may not do well on tasks involving behavioral transformation OF 


extrapolation. When the data was analyzed separately for the male 7 


TABLE 4 


Correlations of Behavioral Cognition Scores with Scores on Measures of Interests 


Interests and 
Values 


Persuasive Interests 
Artistic Interests 
Literary Interests 
Social Service 
Interests 
Economie Values 


. Aesthetic Values 


Social Values 


. Political Values 


and Values for the Total Sample (N = 200) 


Behavioral Cognition 


М 2. 3. 4. 
Missing Expression Social Cartoon 
Cartoons  Grouping Translation Predictions 
(CBS) (CBC) (CBT) (CBI) 

—095 -111 —202b —163* 
3185 084 090 193^ 
077 045 034 —011 

—115 042 & 121 —033 

—052 —061 —151 —100 
125 —021 054 024. 

—157* —065 059 —077 

—010 041 —095 020 


Note.—Decimal points have been omitted, 
* Significant at р = .05 level, nondirectional test. 
b Significant at р = .01 level, nondirectional test. 
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and female sample, some sex differences were found in the pattern 
of relationships. 

Multiple regression analysis was performed for each dependent 
variable by using the Stepwise procedure. Table 5 shows the in- 
dependent variables with significant partial F values. Although the 
multiple correlations for the four dependent variables were sta- 
tistically significant at the .01 level, different variables accounted 
for the prediction in every regression equation. Except in the case 
of Missing Cartoons (CBS), a different set of predictors was 
utilized in the prediction of the dependent variable in the male and 
female samples. In the female sample Extraversion-Introversion 
had a significant partial F value in predicting the performance on 
CBS, CBC, and CBT measures, while in the male sample it was 
significant only in the prediction of CBS scores. 


Summary 

This study was designed to determine the relationship of be- 
havioral cognition with interpersonal perception and personality 
traits. Though cognition of behavioral implications is by definition 
very similar to interpersonal prediction, the present study did not 
find significant correlations between them. As two different meth- 
ods were used to measure these variables, this was considered an 
indication of inadequate convergent validity. The pattern of rela- 
tionship of behavioral cognition factors with the personality vari- 
ables was different in the male and female samples. In the total 
— CBS had statistically significant correlation with Sensing- 
Intuition and Judging-Perceiving scales of MBTI, CBC with 
Extraversion-Introversion and CBI with Sensing-Intuition. Both 
Missing Cartoons (CBS) and Cartoon Predictions (CBI), the 
measures using cartoons, correlated significantly with artistic in- 


terests. Significant multiple correlations were obtained for each of 
the dependent variables. 
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VOCATIONAL INTERESTS AND INTELLIGENCE IN 
GIFTED ADOLESCENTS 


GEORGE 8. WELSH 
The University of North Carolina at Chapel Hill 


Ix a previous study of gifted adolescents it was shown that verbal 
interests and differential performance on а verbal and a nonverbal 
intelligence test were significantly related (Welsh, 1967). Verbal 
interest was inferred from scores on three scales of the Strong Vo- 
cational Interest Blank (SVIB) (Strong, 1959); Advertising Man, 
Lawyer, and Author-Journalist. Verbal intelligence was measured 
by Terman's Concept Mastery Test (CMT) (Terman, 1956) and 
nonverbal intelligence by the D-48 (Black, 1963). Subjects with 
grester verbal interest scored relatively higher in verbal intelligence 
and, conversely, subjects higher in verbal intelligence showed higher 
Scores on verbal interest scales. й 

The present report is а further study of the same subjects using 
à correlational analysis of SVIB scales with the CMT and the 
D-48. In addition to CMT total-score, separate part-scores for the 
two sections, Vocabulary (Synonyms-Antonyms) and Analogies, 
were also utilized since it has been found that for these subjects 
Analogies is more highly correlated with the D-48 than is Vocabu- 
lary (Welsh, 1969). 

From the SVIB 55 regular vocational and 1 
Were examined plus four newly developed scales. These special 
scales resulted from an item analysis of subgroups of the gifted 
adolescents scoring relatively high or low on a figure preference test 
art scale, often used as an index of creative potential (Welsh, 
1959), conjointly with high or low scores on the СМТ. — 

These scales have been conceptualized along two independent 
dimensions, The first dimension, now called "origence," contrasts 
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nonvocational scales 
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those at the low end who prefer an organized, well-structured, ob- 
vious, and explicit situation versus the high origent person who is 
more at home in an open, diffuse, subtle, and implicit task. The 
second dimension, “intellectence,” contrasts those at the low end 
who seem to favor concrete and literal experience versus the ab- 
stract-conceptual approach of the high intellectent person. These 
dimensions have been developed in a two-factor model of еге- 
ativity (Welsh, in press). 

The following nomenclature has been adopted for the four spe- 
cial SVIB scales:* 


8-1 8-2 
High огірепсе High origence | 
Low intellectence High intellectence 4 
8-3 5-4 
Low origence Low origence | 


Low intellectence High intellectence 
Table 1 gives the correlations of all 59 SVIB scales with intel- 
ligence test scores for the sexes separately. | 


TABLE 1 | 


Correlations of Strong Scales with Intelligence Test Scores ; 


Mh ИШИ - 
Y 


= 


Terman CMT Terman CMT y 


Temen быт | 
— —————— Tea узар Аа рав тыш үе Аш Ш 


= 
Artist 17 16 16 03 20 п ж 
Psychologist 47 43 46 24 И 38 38 2 
Architect jo Мда „лб: от 25 25 22 ( 
ysician E xu ой 31 29 29, Ж 
Psychiatrist U Ue 13:30 Tê 32 30 30 1 
D -02  -04 02 —01 06 06 05, a 
va 02.03... 08. 02 06 06 eS 
rinarian —25 -——9? ——17 L068 —10 -i6 Ж ЫШЫ 
п - 
Mathematician 39 31 44 32 6 30 13 
Physicist 29 21 35 27 in 22 26 : 
Chemist 31 23 SB О] 31 27 31 
Engineer 16 09 24 21 21 17 23 2 


3 These special scales may be scored о: 
ғ n the current fi f the SV, 
399 T, by arrangement with Prediction, Inc., Box 298, Greensboro NO 


27402; or write directly to National i P 
Minneapolis, Minn. 55415. ational Computer Systems, 1015 South Sixth St» 
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TABLE 1—Continued 
Male Female 


Terman CMT Terman CMT 


Total Vocab Anal D-48 Total Vocab Anal 0-48 


E MES LLL 18." CT COH 
ЕЕ и 0 0 07 
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-18 o =ОШ MOS ile cu O 
ЖШШЕ ИЕ 30 -2 -20 -19 —03 
08 06 10 п 22 TEE 
—18 =16, —18 —09 -0 -o9 -03 0 
03 o5 -01 —03 05 05 o 05 
21 19 20 08 19 16 20 14 
01 02  -02 -01 o  -0l o 07 
=)3 oo 00 05 o; 08 08 Ч 
12 15 06 03 qo NH 
-04 -oi  -08 —09 PUE DE SUL O 
T 17-10 
—08  —06  -—09 —03 —03 06 9 
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13 14 09 00 12 12 
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32 S 98 13 Le. 18 
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TABLE 1—Continued 


Male Female 
Terman CMT Terman CMT 
Total Vocab Anal D48 Total Vocab Anal D4 
Purchasing Agent -32 -29 -30 -09 -3 -29 -V -4 
Banker -30 -235 -31 -м -37 -37 -9» - 
Pharmacist -и -33 -31 -13 =25 -2%  -19 8 
Mortician -5 -4 -52 -2 -42  -39  -38 -i 
IX 
Bales Manager -235 -17 -31 -15 -28 -uU -% - 
Real Estato 
Salesman -29 -2 -36 -23 -25 -2 -2 3 
Life Insurance 
Salesman -7  -19 -33 -u —90 -25 -3 9 
x 
Advertising Man 0t 10 -04 -12 05 o -Q -i 
Lawyer gea % 21 0 13 16 ШЕ. 
22 22 17 -02 15 19 0$ =1 
XI 
President, Mig. 
Concern ® в -u -0 -з -2 - - 
Non-Oceup. 
Level 1 
43 41 305. 18 30 28 29 
T п 8 05 -o i: o fom 
EE? s o n н "A 
o —-03 08 13 12 09 n, ! 
New 
E -4 -35 -49 -29 iu -u و ت‎ 
83 42 42 36 09 т 44 37 3 
84 =% -35 -33 -15 =39 -33 -%4 | 
43 34 48 41 31 24 36 
Note Decimal points preceding entries have been omitted. Size of groups and significance levels forr ste 
Male Female 
Rass —_ 
CMT р-48 CMT D48 
Rk 00 1 в шш 
nr р 05 085 s 006 
OL -112 Ern He 126 


eo anticipated from the previous study that interests would 
dis i correlated with the CMT than with the р-48. 
‘ms 18 Well confirmed by the number of coefficients of correlation 
ее the .05 level. This is true for both males and 
E е sexes combined, 89 ог 76 per cent of the total- 

correlations are significant, while only 52 per cent of the 
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D-48 r’s reach that level. The frequencies of significant and in- 
significant correlations yield a chi-square of 3025 with p < 0005, 
The total numbers of significant and insignificant coefficients are 
summarized in Table 2. 

It may be seen that females show relatively more significant 
positive correlations for both tests and particularly for the D-48. 
Chi-square analysis, however, indicates that this trend is insig- 
nificant, 

Examination of the verbal-linguistic scales of Group X employed 
in the previous study shows that although Lawyer and Author- 
Journalist are significantly correlated with total CMT scores, Ad- 


TABLE 2 
Summary of SVIB Scale Correlations with Total CMT and D-48 Scores 


CMT D45 
Correlation Male Female Total Male Female Total 
positive 22 29 51 16 25 "m 
negative 20 18 38 1 9 20 
Significante а Р ГЕ ie scu A 
Insignificant 17 12 29 33 25 5 


* At or beyond .05 level. 


vertising Man is not. This latter scale is, however, significantly 
negatively correlated with the D-48. When difference in magni- 
tude and in direction of the CMT and the D-48 correlations are 
considered together, the pattern of rs for all three scales is con- 
sistent in showing an association of verbal interests vrac we: 
Using a different, method of analysis, then, some п 

Obtained for the previous report of systematic relationship between 
Verbal interests and verbal intelligence scores. 

As mentioned above, the correlation of the D-48 and CMT part- 
scores was higher for Analogies than for Vocabulary; the actual 
correlations are .49 and 33 respectively. It would follow that SVIB 
Seales more highly correlated with the D-48 than the CMT should 
have a pattern of correlation with CMT part-scores in which the 
Coefficient for Analogies is higher than that for Vocabulary. Inter- 
est scales more highly correlated with the CMT should show a pat- 
tern of Vocabulary greater than’ Analogies. An analysis of these 
correlational patterns is given in Table 3. 
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TABLE 3 


Summary of CMT Vocabulary-Analogies Correlational Pattern for 
Intelligence Tests and Interest Scales 


Significant on CMT only 
Positive* 9 T 46 1 2 3 
Negative 6 10 16 4 0 4 
7 


32 


Significant on D-48 only 
Positive 0 1 1 4 4 8 
Negative 0 0 0 1 1 2 


39 


1 10 11 


- Significant on CMT and D-48 
Positive 2 7 9 10 13 23 
Negative 3 4 7 7 4 11 


16 34 50 


Insignificant 
8 41» 12 4 2 6 18 


Totals 61 57 118 


` z ET 
Includes one case each with r positive on one test, negative on other. ‚ 


Vocab > Anal Anal » Vocab 
Correlation Male Female Male Female Totals 


CMT but not with the D-48; of these, 32 showed a pattern of Vo- 
cabulary equal or greater than Analogies while only seven showed 
the reverse relationship. Chi-square analysis gives a value of 16.00 
with p < 0005. 

Of the 11 interest scales correlated with the D-48 but not the 
CMT, 10 showed a pattern of Analogies greater than Vocabulary. 
The chi-square of 9.32 has p « .005. 

It may be noted that scales significantly correlated with both of 
the intelligence tests show a similar trend. Thirty-four of the 50 
correlations had Analogies greater than Vocabulary yielding a chi- 
айаш of 6.50, р « .02. There was a slight trend in the opposite 
direction for the insignificant, correlations with 12 out of 18 higher 
Ee Vocabulary, but the chi-square of 2.00 gives a p value of only | 


Finally, а general index of the differential association of interest 


There were 39 interest scales significantly correlated with the | 
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with the two intelligence tests and with the two sections of the CMT 
was made by correlating the columns of correlations in Table 1 
using Pearson т. These coefficients are given in Table 4. 

Both sexes show the D-48 to be more highly correlated with 
Analogies than with Vocabulary. For males the values are .90 and 
ЛЗ, for the females they are .86 and .71. Even cross-sex correlations 
show the same kind of pattern. Male D-48 correlates .90 with fe- 
male Analogies but only .80 with Vocabulary. Counterpart cor- 
relations for females are lower in absolute value but show the same 
pattern, .76 with Analogies and .58 with Vocabulary. 

It should be pointed out that the sexes are quite similar in the 
association of interest scales and intelligence scores. Intercolumnar 
correlations for males and females are: total CMT, 94; Vocabu- 
lary, .92; Analogies, .95; and D-48, .89. 

Some insight into the nature of interests associated with non- 
verbal intelligence may be gained by examining the correlations in 
Table 1. For both sexes the highest correlation for the D-48 is with 
the special SVIB scale S-4 measuring low origence/high intel- 
lectence. S-4 has been found to identify persons who are efficient, 
logical, and methodical; they prefer difficult tasks that can be 
solved by systematic application of rational procedures derived 
from conceptions and abstractions as well as by following rules 
and regulations; in temperament they are introversive though not 
necessarily asocial (Welsh, in press). 


TABLE 4 


Intercorrelations of Columns of Correlation Coefficients for Interest Scales 
with Intelligence Test Measures 


Columnar Correlation 


Test Мае e @ 6 9 0 mm 0) 
92 67 
(1) Total CMT do Л0Д O ga ro4" 102 
(2) Vocabulary 94 73 91 n a p 
(3) Analogies 90 95 
(4)  D-48 85 80 90 8 
Female 
(5) Total CMT 99 t ds 
(6) Vocabulary a) 
(7) Analogies 
(8 р-48 


Note.—Decimal points preceding entries have been omitted. 
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Other leading correlations for males are in order of magnitude: 
Mathematician, Physicist, Chemist, Psychologist, Engineer, Phy- 
sician, Math-Science Teacher, and Senior CPA. For females the 
highest correlations are: Senior CPA, Chemist, Math-Science 
Teacher, Psychologist, Physician, Army Officer, Psychiatrist, Math- 
ematician, Physicist, and Engineer. Although there are some 
interesting sex differences to be noted, in general nonverbal intel- 
ligence is associated with scientific interests particularly the physi- 
cal sciences of Group II and some of the biological sciences of 
Group I. For females, the sole scale of Group VII, Senior CPA, 
also shows a marked associated interest. 

Highest negative correlations with nonverbal intelligence occur 
for the special SVIB scale S-1, high origence/low intellectence; 
again this is true for both sexes. Persons high on S-1 tend to be 
extraversive in temperament and describe themselves by traits such 
as adventurous, easy-going, pleasure-seeking, and talkative. They 
prefer social situations that are not demanding intellectually al- 
though they enjoy pitting their wits against others. 

Regular SVIB scales showing high negative correlations for males 
include: Life Insurance Salesman, Real Estate Salesman, Morti- 
cian, and Sales Manager. For females the order is: Real Estate 
Salesman, Life Insurance Salesman, Banker, Mortician, and Sales 
Manager. This cluster of negative rs includes all three sales oF 
business contact scales of Group IX as well as some of the busi- 
ness detail scales of Group VIII. 

Vocational interests associated with verbal intelligence may be 
Seen in particular by the correlations of the Vocabulary section of 
the CMT. Highest position correlations for males are: Psychologist, 
8-2, Specialization Level, and Psychiatrist. For females they are: 
sa Psychologist, Psychiatrist, and Physician. S-2, the special scale 
is high origence/high intellectence has been associated with cre- 
ativity in adults and rated originality in adolescents. A study of 
poets showed them to score in this section of the two-dimensional 
model mentioned above (King, 1969). 
de a 8-2 deseribe themselves as complicated, dis- 
Ria 4 Е "x а "e unconventional. They are introversive (as s 
ales on are aloof and self-centered. Tasks whic 
үну unstructured seem to challenge them and they 

ginative solutions to problems. In regular Strong scales 
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verbal! intelligence seems to be related especially to the independent 
professions and the biological sciences of Group I. 

Highest negative correlations appear in the male column for 
"Mortician, S-3, S-1, Pharmacist, Purchasing Agent, Veterinarian, 
"and Policeman. For females the order is: Mortician, 8-3, Banker, 
8-1, Purchasing Agent, and Pharmacist. Most of these regular 
-SVIB scales fall in Group VIII, business detail. 

_ In addition to 8-1 which was previously discussed, another spe- 
cial scale, S-3, low origence/low intellectence, also shows signifi- 
cant negative relation. S-3 subjects see themselves as appreciative, 
energetic, friendly, and practical. They are extraversive and seem 
to enjoy working with people in a direct relationship. Routine tasks 
related to tangible matters engage their interest and they prefer 
а regular, orderly, and systematic approach to problems. 

The relationship of vocational interests to intelligence is obviously 
‘complex, but evidence from the performance of gifted adolescents 
Suggests several associated trends. 

There seems to be a positive relationship between nonverbal in- 
telligence scores and scientific interests in the physical sciences and 
‘other vocations stressing methodical and rational approaches to 
their problems, Business interests, particularly in sales occupations 
‘and in vocations requiring social and personal contact with people, 
show negative relationship with nonverbal intelligence scores. 

The previously reported association of verbal-linguistic interests 
and verbal intelligence was less clearly demonstrated by the cor- 
relational approach of the present study although а tendency to- 
ard this kind of a relationship was found in the patterns of dif- 
ferent intelligence scores. Interests in the professional, biological 
sciences show a positive relationship to verbal intelligence scores, 
"While interests in business detail and other occupations character- 
ized by routine and systematic procedures seem to be negatively 
related. 
` Some evidence was found to justify using part-scores on the 
CMT? since the kind of intellectual ability and interest required 
for good performance on Analogies seems to be related to D-48 
Scores, 
= 2The CMT M CMT Manual gives only total-score norms, however, complete 


i ailable for 
norms based on the gifted adolescents of the present study are av: 
e part-scores as well as OMT total and the D-48 (Welsh, 1969). 
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Finally, some of these observations may be arrayed on a two- 
dimensional conceptual model that ties together personality traits, 
vocational interests, and intellectual performance. 
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MEASURES OF EGO IDENTITY: А MULTITRAIT 
MULTIMETHOD VALIDATION! 


FRANK BAKER 
Harvard Medical School 


Евікѕох examines the growth of the personality in terms of a 
series of eight successive stages, “predetermined in the human or- 
ganism's readiness to be aware of, and to interact with a widening 
social radius" (Erikson, 1959, p. 52). As the developing indi- 
vidual encounters different, aspects of the social environment, each 
step becomes a potential erisis because of the attendant radical 
change in perspective. The problems of each stage can be solved in 
one of two polar directions which lead to the development of a 
series of alternative basic senses or attitudes: (a) trust versus mis- 
trust, (b) autonomy versus shame and doubt, (c) initiative versus 
guilt, (d) industry versus inferiority, (e) identity versus identity 
diffusion, (f) intimacy versus isolation, (g) generativity versus self- 
absorption, and (h) integrity versus disgust and despair. The psy- 
chosocial quality of each basic attitude becomes more differenti- 
ated, as the ego comes into the possession of @ more intensive 
apparatus, even as society challenges and guides such extensions. 

Adolescence is the stage at which Erikson postulates a crisis of 
identity. The term “ego identity” is used by Erikson to denote cer- 
tain comprehensive gains which the individual, by the end of adoles- 
cence, must have derived from all pre-adult experience in order to be 
ready for the tasks of adulthood. The alternative to the establish- 


E ARA | ў 

1 Тыз article is based on data collected for а рани completed, d 
the degree of Doctor of Philosophy at mud T Mirac e уен 
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ment of a sense of identity is the development of a sense of ident 
diffusion. While identity diffusion is temporarily unavoidable in 
adolescent period of physical and psychological upheaval, the 4 
ger is that there may result a permanent inability to “take hold.” 
While Erikson asserts that “a sense of identity . . . can be defined) 
and evidence from the presence of a dominant attitude of this kind! 
сап be described behavioristieally" (Erikson, 1950, p. 63), until 
recently, there has been little empirical investigation of the sense ў 
ego identity. Perhaps the main explanation is to be found in the 
lack of clarity in Erikson's own conceptualization of the term. H 
chooses to use his term identity with a number of connotations. 
The present study attempts the difficult task of translating 
abstract, global, imprecisely defined concept into concrete oper 
tional terms. In an attempt to identify the major components 
identity, Erikson’s clinical description of identity diffusion & 
his discussion of the healthy personality were closely examined. — 
In one of his earliest writings on the topic, Erikson describes t 
central loss of a feeling of identity among young war veterans: 


What had broken down . . . was a sense of identity, а sense 0 
who one is, of knowing where one belongs, of knowing what ¢ 
wants to do. I must emphasize from the outset that this kind о 

identity is not the same as that which is called ‘personal iden; 
tity” It is not an amnesia of one’s name or history. It is га 
a breakdown of the sense that there is continuity and samenes 
and meaning to one’s life history, (Erikson, 1950, p. 16). 


In a later essay, Erikson (1959) defines the sense of ego identi 
as follows: 


The sense of ego identity, then, is the accrued confidence WB 
one’s ability to maintain inner sameness and continuity 
matched by the sameness and continuity of one’s meaning 1 
others, (Erikson, 1959, p. 89). 


He goes on to describe its concomitants as “а sense of 'knowin 
where one is going,’ and an inner assuredness of anticipated rece 
nition from those who count” (Erikson, 1959, p. 118). 4 

Four aspects of the sense of ego identity were derived from t° 
and other such statements. In contrast with individuals with а. 
fuse sense of identity, an individual with a well developed se 


FRANK BAKER ж 
identity: (а) knows who he is, (b) knows where be is going, (e) 
perceives himself as having “inner sameness and continuity," and 
(d) is certain about the way his perception of himself compares 
to the perceptions which others have of him. 
Method 
Conatruction of Scales 


For each of the four aspects of identity defined above, an eight 
item Likert-type scale was designed. Each of the thirty-two de- 
elarative statements was written following the assumption that И а 
person strongly agrees with such statements, it would indieate one 
extreme of the particular characteristic being tapped, and if he 
stron zly disagrees, it would indicate that he possesses the opposite 
extre е. In order 8 МИЙ Ба END Ы ЧЕ ter or rng 
sponse set, each seale was composed an equal number 
tively worded and negatively worded items. Illustrative examples 
of the items for each scale are presented below: 

Knows who he is : 
It isn't necessary to be в chameleon and be all things to all peo- 


ple in order to get ahead in life. 
What a bore it is, waking up in the morning always the same 


person. 
Knows where he is going 

The major decisions a person makes are guided by the plans be 
has for the future. 

Life is chaotic, without direction or meaning. 

Perceives himself as having “inner sameness and continuity" 

À person doesn't change much once he has started out in the 
world. 

No one is the same person from day to day. ы. 
Is certain about the way his perception of himself compares 
Perceptions which others have of him " 

A person can be confident, of getting recognition from those w! 
count. : 

What really matters is what other people think; it is not enough 
just to be sure of oneself. i 
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Five levels of response were provided for all of the items: 
“strongly agree,” “agree,” “undecided,” "disagree," and “strongly 
disagree.” For positively worded items, “strongly diagree" was 
scored 5, “agree,” 4, “undecided,” 3, “disagree,” 2, and “strongly 
disagree,” 1. Scoring of negative items was in the reverse direction. 
Scores on each variable were arrived at by adding up the reversed 
scores for the particular items composing each of the eight scales. 


Sentence-Completion Instruments 


Fight sentence stems for each of the four aspects of identity were 
written. The stems were specifically designed to elicit responses rele- 
vant to expressions of one of the four characteristics of identity. 

The thirty-two sentence stems were assembled in an alternation 
format and labeled “Incomplete Sentence Blanks.” This instrument 
was administered as part of the longer questionnaire containing 
the scales described above. Illustrative sentence completion stems 
for each of the four aspects are presented as follows: 


Knows who he is 
When somebody confuses me with someone else, Г... 
Pretending to be somebody you aren't is... 
Knows where he is going 
The things I want out of life аге... 
In making plans for the future, I... 
Perceives himself as having “inner sameness and continuity" 


The person I was yesterday and the person I am today аге... 
If it were possible to go back in time and see myself as I used to 
be, I would probably feel that I... 


Is certain about the way his perception of himself compares to the 
perceptions others have of him 


Iam sure that people think of meas... 


In a comparison of the way I see myself and the way others see 
me,Ithink... 


A scoring-by-example manual was developed for scoring the “In? 
complete Sentence Blank” following much the same procedure de- 
tailed by Renner, Maher, and Campbell (1962). On the basis of pre- 
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liminary administrations of the instrument, 185 protocols were 
obtained and were used as a reservoir from which responses were ab- 
stracted which seemed relevant to the descriptions of each of the 
seven variables. Each response was assigned a weight of 5, 4, 3, 2, or 
1, depending upon whether E judged the item to be: 5—strongly 
indicative of the positive pole of the variable, 4—somewhat indica- 
tive of the positive pole of the variable, 3—ambiguous or could not 
be scored, 2—somewhat indicative of the negative pole of the vari- 
able, 1—strongly indicative of the negative pole of the variable, 
Each completion was judged only in terms of characteristics of the 
variable its stem had been designed to elicit. 

This pool of items was assembled into a scoring-by-example 
manual. The end product was a manual made up of illustrations of 
scorings of sentence completions for each incomplete sentence stem 
for the one variable the stem had been designed to tap. 

Two raters independently scored samples of 20 protocols drawn 
from the, pool of 705 questionnaires used in this study. To reduce 
the “halo” effect as much as possible, each item was scored on all 20 
questionnaires before the rater proceeded to score the next item. 
The stack of questionnaires was shuffled between the scoring of 
each item. After this procedure had been completed, inter-rater re- 
liability was computed for each of the variables by computing the 
product-moment correlation for raw scores. The correlations ob- 
tained were: Knows who he is, +0.95; Knows where he is gomg, 
+0.96; “Inner sameness and continuity,” +0.96; and “Knows his 
stimulus value,” +0.96. х 

In the major validity study, the description of which follows, 
each of the 705 protocols was seored by one rater, with similar pre- 
cautions against a “halo” effect. The scores assigned to each озса 
dent on the sentence completion were the sums of the raters scores 
on those completions for the stems of each aspect of identity. 


Subjects 


The instruments, previously described, were administered during 
Orientation Week as part of a longer questionnaire to 715 male 
freshmen entering Lehigh University. Seven-hundred and five ques- 
tionnaires were returned sufficiently completed to be used in this 
study and the analysis of data presented here is based upon this 
sample of 705 male college students. 
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Results and Discussion 
Reliability 

Reliability was estimated by computation of Kuder-Richardson 
(Formula 20) reliability coefficients. The reliabilities for the vari- 
ables under consideration as measured by the Likert-type scales 
and sentence completion methods are presented in Table 1. 

These reliabilities appear to be quite low. In order to check 
whether or not they are significantly different from zero, the ob- 
tained reliabilities could be turned into an F ratio by the formula: 
rer = 1 — 1/F. This F is for the main effect of persons tested against 
the person-times-items interaction. The degrees of freedom are as 
follows: (Persons — 1) and (Persons — 1) (Items — 1). Since we 
have 705 persons and eight items, the degrees of freedom would 
be 704 and 4,928. From an F table, we discover that the F required 
for р < .01 with these df is approximately 1.13, which translates 
into a reliability of .11 as the minimum p < .01 level. The more 
persons, and the more items, the lower the reliability required for 
significance. Applying this criterion, the obtained Kuder-Richard- 
son reliabilities are all significantly different from zero. 

However, since the magnitudes of these reliability coefficients 
compare unfavorably with those usually reported, one possible ex- 
planation should be mentioned. An explanation lies in the nature 
of the concepts themselves, the definition of which poses complex 
multi-dimensional constructs. Since increasing complexity of a trait 
increases the probability that individuals will exhibit unique pat- 
ternings of the component elements, it is generally true that the 
more complex the measure, the lower the reliability estimate will 


TABLE 1 


Kuder-Richardson Reliability Coefficients for Scales and 
Sentence-Completion Measures 


Sentence-Completion 
Scales Measures 

Knows who he is mm E 
Knows where he is going d 26 
Inner sameness and 

continuity к M 
Knows his stimulus 

value di " 
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be. It is important to note that the Kuder-Richardson reliability 
coefficient for the F scale, which was also administered to this stu- 
dent group at the same time, is only .39 and using that classic in- 
strument, as а standard of comparison these new measures compare 
very favorably. 

It can be further argued that internal consistency is irrelevant 
to the testing of the hypotheses. These items represent an effort at 
translating some global, abstract concepts into concrete operational 
terms. The combination of the items is inevitably better at repre- 
senting the construct in question than any one of them would be. 
Although in each item some of the specifie details introduce tan- 
gential values, these irrelevancies differ from item to item, tending 
to be overweighed, in the total score, by the common core. The item 
set, whether or not it were to turn out that the individual differences 
within a measure cohere as a psychological syndrome, would re- 
main, subjectively, the most accurate operational representation 
of the construct in question. 


Relations between Aspects of Identity 


Table 2 presents the correlation matrix which resulted from 
intercorrelating the total scores of the four aspects of identity, as 
measured by the specially constructed Likert-type scales described 
above with the total scores of sentence-completion measures of these 
same variables, Such a matrix of intercorrelations resulting when 
each of several traits or constructs is measured by each of several 
methods has been referred to as а multitrait multimethod ma- 
trix (Campbell and Fiske, 1959). i 

Campbell and Fiske have noted that: “Insofar as the traits are 
expected to correlate with each other, the monomethod correlations 
will be substantial and heteromethod correlations between traits will 
also be positive” (Campbell and Fiske, 1959, p. 104). Reversing their 
emphasis on discrimination between measures of different traits, 
the concern here is with showing the convergence of related traits 


(or aspects of a construct) across methods. If traits are predicted 


to be closely interrelated, they should be significantly — 
When measured by the same methods and when measured by differ- 
ent methods. 

Three of these four aspects of identity as measured by the same 
method and as measured by different methods show significant 
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TABLE 2 


M ultütrait M'ultimethod Matriz Based on Direci-Attitude and Sentence-Completion 
Measures of Ego Identity 


Direct-Attitude Method 


sdf = 703, r = 07 is significant at 5% level; r = .10 at 1% level. 


intercorrelations in all cases. In the single method triangle represent- 
ing the intercorrelations of the multiple variables measured by the 
one method of Likert-type scales, “Inner sameness and continuity” 
is not significantly correlated to “knows who he is,” “knows where 
he is going,” or “social stimulus value.” Its correlation with the first 
two of these is near zero and the correlation with the last is negative 
and almost at the 5 per cent level of significance. In the sentence- 
completion triangle it is significantly correlated with “knows who 
he is” and “social stimulus value” but again is near zero in its cof 
relation with “knows where he is going.” 

While all four of the defined characteristics of identity have 
monotrait-heteromethod values which are statistically significant 
from zero and hence evidence convergent validity, only “inner 
sameness and continuity” has a validity diagonal value higher than 
values lying in its column and row in the multiple-methods block. 
This would indicate that “inner sameness and continuity” is more 
closely related to itself measured by different methods than to the 
other aspects of identity. 

Consistent with the preceding evidence of independence, чоне 
sameness and continuity" alone meets the Campbell and Fiske 


discriminant validity requirement of correlating higher with itself 
than other traits which happen to employ the same method. “Имит 
sameness and continuity,” contrary to the theoretical prediction, 
would appear to be a trait unrelated to the other three aspects of 
identity. 

The data support this prediction that the college student respon» 
dents studied here know who they are, know where they are 
going, and are aware of how their perception of themselves com- 
pares to the perceptions others have of them. Apparently they do not 
necessarily see themselves as having “inner sameness and com- 
tinuity” as it has been empirically defined here. Conversely, respon- 
dents who are not sure who they are, are also not sure where they 
are going, or what their “social stimulus value is”; do not necessarily 
lack a feeling of inner sameness and continuity. 

Conclusions 

This study was successful in translating Erikson's concept of a 
sense of ego identity into useful operational terms and three of the 
four aspects of the concept were found to be significantly intereor- 
related, “Inner sameness and continuity,” appears to be unrelated 


ing since any study such as this is both study of the instruments 
and the theory on the basis of which the instruments are con- 
structed and their interrelationship predicted. “Inner sameness and 
continuity” as it was operationalized here also may contain an in- 
flexibility that Erikson had not intended in his use of the concept. 

The results of this study lend support to the concept of identity 
as a significant variable, descriptive of variations of self-attitude 
among late adolescents and related to the earlier development of à 
ense of trust versus mistrust Further research on eme 
validity of the combined three S identity 
here is called for in order to test the empirical usefulness of the 
Erikson theoretical structure. 
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DIMENSIONS OF PSYCHOPATHOLOGY IN MIDDLE 
CHILDHOOD AS EVALUATED BY THREE SYMPTOM 
CHECKLISTS 


ELISE E. LESSING лхо SUSAN W. ZAGORIN* 


Institute for Juvenile Research 
Chicago, Illinois 


In spite of widespread agreement regarding the desirability of a 
theory-derived diagnostic classification system for the psychiatric 
disorders of childhood (Bard, Sidwell, and Wittenbrook, 1955; 
Rutter, 1965; Freud, 1965; Achenbach, 1966), no personality theory 
has obtained sufficient general support to form the basis for such а 
system, Increasing numbers of investigators have, therefore, focused 
their efforts upon developing а descriptive classification of manifest 
symptoms defined in behavioral terms with a minimum of interpre- 
tation and inference. 

As Miller (1967a) has indicated, many variables contribute to 
the differences across studies in regard to the descriptive classifica- 
tion scheme derived. Mathematical considerations such as the type 
of factor analytic rotation employed and the number of factors 
extracted, subject population considerations such as the types of 
children evaluated, and item considerations such as number of items 
representing a given problem area can all influence what emerges 
as the major syndromes in а given study. Only the bipolar division 
of symptoms into those involving primarily inner discomfort and 
those involving negatively valued acts against persons от objects 
in the environment has been demonstrated to have а generality 
that transcends most variations in subjects, items, and procedures. 

1The authors would like to express their appreciation to м ides 


Who provided statistical consultation and supervised the canoni 
analysis, e to Merton Krause and Terdinand van der Veen for helpful 


comments regarding the manuscript. 
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Thus, Peterson's (1961) Personality Problem and Conduct Problem, 
Achenbach’s (1966) first principal bipolar factor of Internalizing- 
Externalizing, the second-order factors of Aggression and Inhibi- 
tion identified by Miller (1967a), and the Rebelliousness and An- 
xiety factors of Collins, Maxwell, and Cameron (1962) all reflect 
the same basic dichotomy. 

Other factors, proposed as basic syndromes, have shown less sta- 
bility across studies. Learning Disability or School Failure has 
been identified by several investigators, each using a different check- 
list (Collins, et al., 1962; Brewer, 1961; Miller, 1967a). However, 
this factor has never emerged in the series of studies utilizing the 
Peterson Problem Checklist (Peterson, 1961; Quay, 1964; Quay and 
Quay, 1965). On the other hand, the Inadequacy-Immaturity factor 
derived from Peterson Problem Checklist data by Quay and his 
collaborators (Quay, 1964; Quay and Quay, 1965) was not identi- 
fied by Miller or Brewer, though their subjects were child guidance 
clinic patients among whom such a cluster of symptoms might be 
expected on a priori grounds. 

The practical problem of selecting a symptom checklist for pos- 
sible future routine use at the Institute made a salient issue of the 
extent to which differences in item content would affect one’s conclu- 
sions regarding the nature of the presenting psychopathology. Previ- 
ous studies provided considerable data regarding factor stability 
across samples with symptom items held constant (Quay and 
Quay, 1965; Quay, Morse, and Cutler, 1966), but very little data 
regarding factor comparability across item samples with subject 
sample held constant. Therefore, the present study was undertaken 
with the following specific purposes: (a) to compare the symptom 
factors obtained when three different sets of items were utilized for 
behavior ratings of the same sample of children, and (b )to provide 
additional information regarding the major factors obtainable 
from the widely used Peterson Problem Checklist, with particular 
attention being given to the item content of the Inadequacy- 
Immaturity factor. 


Method 
Sample 


Subjects were 102 children, aged 10 years 0 months through 12 
years 11 months, who received a psychiatric examination at the 
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Institute for Juvenile Research from September, 1963 through June, 
1965, after being referred to the Institute for child guidance services, 
All admissions during this time period were included in the research 
sample except for children whose psychiatric interview occurred 
during vacations of the research personnel conducting the project. 
The 102 children constituting the sample for the present study were 
part of a group of 110 children who served as the clinic sample for a 
validation study of the IPAT Children's Personality Questionnaire 
(Lessing and Smouse, 1967). The original sample was limited to 
children with IQ scores of 70 or higher. The present sample was 
further limited to those children whose mothers evaluated them on 
both the Peterson Problem Checklist and the Wichita Guidance 
Center Checklist and whose examining psychiatrist coded the IJR 
Symptom Checklist on the basis of the mother’s oral report of the 
child’s symptoms. 


Instruments 


The Peterson Problem Checklist (Peterson, 1961) and the Wichita 
Guidance Center Checklist (Engel, 1955; Brewer, 1961) were selected 
for use since both are item samples from the universe of referral 
symptoms reported by parents of disturbed children brought to 
child guidance clinics. However, the two instruments had yielded 
different factor structures in previous studies. 

The Peterson Problem Checklist consists of the 58 most frequently 
reported symptoms (e.g. “Disobedience,” “Daydreaming”) among 
477 representatively chosen child guidance clinic cases. The moth- 
ers in the present study were merely asked to circle the number of 
each item which described their child. Although Peterson pin 
mothers to judge whether the problem was “mild” or “severe in 
his original large-scale study, the differentiation of severity was 
dropped in the statistical analysis he performed (Peterson, 1961, 
p. 205), and thus did not influence the factor structure he obtained. 

The Wichita Guidance Center Checklist (Engel, 1955; Brewer, 
1961), like the Peterson Problem Checklist (PPC) contains items 
based on parental descriptions of the referral problems of pre- 
adolescent child guidance patients. However, while the PPC items 
were culled from a large sample of case records, the WGCC items 


were based on detailed narrative deseriptions composed by 25 moth- 


ers, supplemented by additional items obtained from 25 other clinic 
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mothers. The 55 items ineluded in the final checklist significantly 
discriminated between a group of children of PTA mothers and в 
group of children referred for child guidance clinic services. The 
WGCC items are in sentence form, such as: “My child does not 
seem to be learning like he (she) should” and “My child is a disei- 
pline problem, at home and in school." Brewer's (1967) factor anal- 
ysis of ratings on this checklist made by the mothers of 200 boys, 
aged 5-13 years, who were referred for child guidance clinic services, 
yielded five factors: Conflict with Parents, Conflict with Teachers, 
Failure in Peer Relations, Inner Tension, and Schoo! Failure. 

The IJR Symptom Checklist (RC) is a list of 36 symptoms 
such as “Stealing,” “Shy, withdrawn, timid” etc. which were com- 
piled by the Institute's staff psychiatrists. These items appear on 
case data summary cards (see Lessing and Schilling, 1966) and are 
routinely coded by the examining psychiatrist shortly after the 
diagnostic psychiatric interviews with the mother and child. Like 
the PPC and the WGCC, the IJRC codifies data regarding the 
referral symptoms of the child in dichotomous form. 


Procedure 


The mothers were asked to fill out the two symptom checklists 
while the child was being interviewed by the psychiatrists. In half 
of the sample, the PPC was administered first and in the other half 
the WGCC was given first. The second checklist was introduced 
with the explanation that the clinic was interested in getting 85 
complete a picture of the child’s symptoms as possible and the first 
list did not cover everything. The mothers were asked to check 
everything that applied on the second checklist even if it had already 
been checked on the first. The mother was then interviewed by the 
psychiatrist, who inquired about the child’s problems. The mother’s 
responses were then summarized by the psychiatrist in his coding 
of the IJR Symptom Checklist. 


Statistical Analysis 
All checklist items were coded as either 1 or 0. Then а product 
moment correlation matrix was computed among all items com- 
prising each of the three checklists. Separate principal axis factor 
analyses were performed upon each of the three correlation matrices. 
The squared multiple correlation of each variable with all other 
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variables was used as the communality estimate in the diagonals 
The decision regarding the number of {actors to be rotated was guided 
by two initial criteria: (1) retained factors should have an eigen» 
value of at least 2.00, and (2) retained factors should account for at 
least 5 per cent of the variance. However, three {actor solutions 
were tried for each checklist: a solution involving tbe number of 
rotated factors dietated by the initial criteria, а solution involv- 
ing one less than this number of factors, and a solution involving 
one more than this number of factors. The final choice for esel 
checklist was that factor solution which produced the strongest 
most distinctive factors (ie. those with many loadings above 40, 
but whose defining items have high loadings only on the {actor being 
defined). 

The similarity between the PPC factors obtained in this study 
and those obtained in previous studies was evaluated by means of 


loadings of all items on the pair of rotated factors being compared 
divided by the square root of the summed cross products of the fac- 
tor loadings. The coefficient is interpretable as the equivalent of a 


correlation coefficient. 


defined for each of the three checklists administered in the present 
study, factor scores on each of the rotated factors were first com 
puted by the method of ideal variables (Harman, 1960, pp. 360- 
361), and a congruence matrix of factor scores was constructed. 
The elements of the matrix were coefficients of congruence (Harman, 
1960, p. 260) computed between every pair of factors. 


That is, for each of а given 
new set of variates. ( 


way that the new variates 
comparable variates on the жер 
сап be considered as factors result тее 

toring” of two checklists. The variables pant in pa case ме 
the binormamin-rotated factors obtained in initial independ 
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factoring of each checklist. A canonical variate pattern was ob- 
tained for each of the two checklists by postmultiplying the matrix 
of congruence coefficients among the initial factors by the matrix of 
canonical coefficients (the weights for combining the factors into 
canonical variates). This canonical variate pattern is analogous to, 
and can be interpreted as, a factor pattern. That is, it contains the 
loadings of the initial factors on the canonical variates. Since the 
canonical variates are orthogonal, these loadings are the correlations 
between the initial factors and the canonical variates. As part of 
the canonical analysis of each pair of checklists, the canonical cor- 
relations between the new variates were obtained. The magnitude 
of these correlations was then compared with that of the congru- 
ence coefficients between factor scores on the original, uncombined 
factors. Direct comparisons were possible because the coefficients of 
congruence were computed on deviation scores, with a mean of 
zero, so that they were in actuality correlation coefficients. 


Results 
Factor Analyses of Three Checklists 


The 102 child guidance clinic children comprising the experimen- 
tal sample showed a high degree of heterogeneity in associations be- 
tween symptoms. In the case of each of the three symptom check- 
lists, the three or four factors rotated originally accounted for less 
than half of the total common variance. Rotation of many addi- 
tional, inefficient factors, accounting for small proportions of the 
total variance, would have been required to bring the cumulative 
variance accounted for up to a figure much over 50 per cent. 

In the case of the Peterson Problem Checklist (PPC), the four 
factors rotated accounted for 41 per cent of the common factor var- 
lance. A five-factor solution would have met the two criteria that all 
factors should have eigenvalues of at least 2.00 and account for at 
least 5 per cent of the variance. However, this solution was rej ected 
because the fifth factor was a weak one with the highest factor load- 
ing being .52. Table 1 presents the rotated factor loadings for the 
four-factor solution along with loadings for the same variables from 
four other samples of comparable age who were administered the 
same items. Table 2 contains the values of Tucker’s coefficient of 
factor congruence (Harman, 1960, p. 257) for the PPC factors ob- 
tained in the present and the four previous studies. 
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Factor I in the present study has its highest loadings on "dis- 
obedience," "difficulty in disciplinary control," "destructiveness," 
and “fighting.” Tucker coefficients of .84, .85, .85, and .88 were ob- 
tained when this clear-cut Conduct Problem factor was compared 
with the Conduct Problem factor obtained in the earlier studies of 
fifth- and sixth-graders (Peterson, 1961), seventh-graders (Quay, 
1964), eighth-graders (Quay, 1964), and pre-adolescent delinquents 
(Quay, 1966). 

Factor П is apparently a combined Personality Problem and Aut- 
ism factor. The highest loadings are on “preoccupation,” “lack of 
interest in the environment,” “excessive daydreaming,” “sluggish- 
ness,” and “lack of confidence.” As is indicated in Table 2, Tucker 
coefficients of .80, .88, .56, and .73 were obtained when this factor 
was compared with the Personality Problem factor obtained in 
previous studies. Tucker coefficients of .74 and .73 were obtained 
when this factor was compared with the Inadequate-Immature and/ 
or Autism factor of previous studies. 

Factor III of the current study, with its highest loadings on 
“self-consciousness” and “hypersensitive,” is apparently a Per- 
sonality Problem variant. Tucker coefficients of .60, .51, .81, and 
-74 were obtained when it was compared with Personality Problem 
factors obtained in previous studies. 


ин НЕНЬ НЬ 0) 


TABLE 2 


Coefficients of Factor Congruence between PPC Factors Obtained in the Present Study 
and Previous Studies of Children of Similar Age 


Peterson’s Quay and 


Fifth and Quay’s Quay's Pre 
Sixth Seventh Quay and Quay’s Adolescent 
Grades Grade Eighth Grade Delinquent 


Present 


1 


1 п 1 II III 


I сез 
blem -84 .12 
II Personality 
Problem- 
Autism 39. 
III Personality » 
Problem PEG 
IV Organic © 
Problem 197 95 


Note — ilitate i 
epe fuam QM. RENE Toe Согот зох factors obtained in previous studies have d 
Neurotie; and III is Inadequacy-Immaturity eres oblem or Psychopathic; II is Personality > 
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Factor IV, with its highest loadings on "restlessness," "nausea," 
“tension,” “nervousness,” and "specific fears," appears to be an 
Organic-Somatie factor which has not been found in previous 
studies utilizing the PPC. The Tucker coefficients of .56 and .65 
for the similarity between this factor and the Conduct Problem fac- 
tor found in Quay and Quay's (1965) samples of seventh- and 
eighth-graders are not particularly meaningful, since Quay and 
Quay excluded from their analyses the somatic symptoms having 
ihe highest loadings on Factor IV of the present study. Much 
lower coefficients of factor similarity (.27 and .38) were obtained 
when Factor IV was compared with the Conduct Problem factors 
derived when the major portion of the PPC items were used as in 
the studies of fifth- and sixth-graders (Peterson, 1961) and pre- 
adolescent delinquents (Quay, 1964). 

In the case of the Wichita Guidance Center Checklist (ССО), 
four factors, accounting for 48 per cent of the common factor vari- 
ance, were rotated. Though only three principal axis factors meet 
the twofold criterion of having eigenvalues of at least 2.00 and ac- 
counting for at least 5 per cent of the total variance, the four- 
factor solution was chosen after inspection of the rotated factor 
patterns. The four-factor solution produced a strong fourth factor, 
With several loadings above .50, without disturbing the three previ- 
ously extracted factors. Table 3 presents the rotated factor loadings 
for all items with loadings exceeding + or —.39 along with loadings 
for the same items from Brewers (1967) factor analysis of data 
obtained from 200 child guidance elinie boys, aged 5-13 years. 

The first factor obtained in the present study has highest load- 
ings on items stating that the child "does noi seem to be learning 
like he should," *makes only passing grades," and "never finishes 
assignments in school." This eluster is readily identifiable as the 
School Failure factor obtained by Brewer, and yielded а Rucker co- 
efficient of .86 when compared statistically with the Brewer пое 
Factor II has highest loadings on items stating that the chil 
“often does things to attract attention even though he е 
punished,” “is a discipline problem at home and in school, к re- 
quently gets into things that he knows he shouldn't," is often hit- 
luently gets g Se eee a” 
ting and pushing other children,” and “jg driving his teacher mad » 
This clear-cut Conduct Problem factor resembles the Conflict wit 


Teacher factor previously obtained by Brewer, and yielded a Tucker 
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TABLE 3 


Rotated Factor Loadings for WGCC Items with Loadings from a Previous 
Study Presented for Comparison 


Present Study Brewer Study 


Variables 


I. School Failure 
37. Makes only passing 
les 


grad 68 —18 -23 00 —01 76 —09 10 
38. Never finishes school 
assignments 67 26 —06 —15 09 82 —21 20 8 
5. Not learning like 
he should 66 —04 06 08 23 74 -07 16 - 
27. Can't keep up with 
other children 62 —21 24 —18 05 72 34 —05 
6. Cannot conform to 
school tasks 59 28. —08 —06 23 62 —06 31 
39. Not ready to do the 
work expected 56 —11 10 —03 39 60 03 -02 = 
40. Discouraged when 
having to do some- 
thing on his own 52 06 19 —23 43 39 39 —391 3 
42. Will not respond 
in class 52 25 03 —18 —02 66 -11 08 
15. Can't get interested 
in anything 48-—19 "97 "di 08 44 2 В 
25. Very poor reader 48 —07 14 —17 05 74 -ù -07 = 
49. Shortattention span ATH T ER Ib —01 20. 929 33 3 9 
1. Seldom finishes 
things HONE 110097005 5120 20 38 -0% 15 
3. Learning under 
force at home 45 01 12 08 29 19 м M 
20. Lacks self-confidence 45 —13 22 —05 —32 36 49 —15 
50. Can’t do anything 
right Ие 418, —02 20 43 8 
44. Has the ability, 
but won't use it O5 ONES) 14: 19 48 -03 38 
32. Daydreams a great 
deal SSS Nx ча ^ s7 ^ 20 01 
П. Conduct Problem 
52. Does things to 
attract attention OB ШТУ o 15519 зол —18 (02200072 И 
51. Frequently gets into [ 
things he should not —03 64 —11 15 66 —06 -02 40 З 
11. Discipline problem. Г 
(home and school) 01 62 11 11 52 —16 13 6 
33. Is driving teacher {Ө Г 
mad ДӨ; OLOINEN ms zl 03 
9. Hits other children ^ —03 57 06 = 03 Zor de 4 5 
41. Temper-tantrums —06 53 30 02 56 —24 29 MW E. 
55. Restless in school 4 59 0 (от 08 —o5 42 7 
24. Behavior is 


` 5 
unpredictable 00 48 24 —17 o) Loi 27 4 Шш 
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TABLE 3—Continued 


Present Study Brewer Study 
Variables £ п Dr tv I п ZH ЗҮ v 
(om 
И. Jumps or moves 
around all the 
time 05 48 16 -12 12 -17 39 5606 00 
19, Rebellious and 
resentful —23 47 36 14 64 —20 26 30 42 
8. Has to have every- 
thing his way —08 4 22 22 49 -15 3 33 39 
8. Teases and torments 
other children 06 40 07 05 09 -—09 160 49 58 
0. Driven to talk 
| constantly 117° *86 '*04^ —06 4; -31 37 53 —24 
1 Unhappiness-Unsociability 
4. Never seems happy 13 —0 62 -04 16+. 34 —3T. 5001 71 
8. Seems to hate 
» everybody —03 06 54 —08 40 —10 44 —15 5 
Ж. Out of step with 
4 нуз vi of life 17 —08 52 25 29 —09 13: 32 32 
- Constantly irritable 
] with children —12 35 50 —09 12 —26 23 m F4 
^ hanny allthetime 14 —08 46 19 $2! o IT AAT 
Н. Can't make friends 
in school 9p 9] 43" 08 —o# 00 10 X ы 
3L Irritable at home —14 16 41, 19 4) 00 52 1 


t Conflict with Parents 
- Refuses to pick up 


TAE т IE. 
eam aa LN 
Poe r ш 
B, бш is QE ec n = 51 
sith father ДЕ -02 -12 28 35 ул] Baa co 


= i been abbre- 
Note.—In order to conserve space, all decimals have been omitted and all WGCC items have al 


coefficient of .78 when compared with the Brewer factor. Factor ш 


has its highest loadings on items stating that the child “never seems 


^ ith » 
happy,” “i irritable with the children he plays with, 
DD," ie, constant r him,” and “is out of step 


“ 
seems to hate everyone who comes nea : ‘abi 
with the way of life in our b ome." This Unhappiness-Unsociability 


factor is fairly similar to the Failure in Peer Relations E 
identified by Brewer, and yielded a Tucker coefficient of .65 " en 
the two were compared. Factor IV has highest loadings on 1 на 
stating that the child "refuses to pick up clothes and toys aroun 
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the house," and "refuses to do things to help around the house." 
This factor is moderately similar to the Conflict with Parents factor 
obtained by Brewer, and yielded a Tucker coefficient of .56 when 
compared with the Brewer factor. The congruence between the two 
factors was lowered by the fact that several of the rebellious be- 
havior items which loaded on Brewer's Conflict with Parents factor 
also loaded on Factor 11, Conduct Problem, in the present study. 
The four-factor solution utilized in the present study did not yield 
a factor comparable to Brewer's Inner Tension factor. 

In the case of the IJR Checklist (IJRC), the three factors ro- 
tated accounted for 43 per cent of the common factor variance. 
Table 4 presents the rotated factor loadings for all items with load- 
ings exceeding + or —.35. Factor I, with its highest loadings on 
“excessive demands for attention,” “disobedient,” “immaturity,” 
and “aggressive bullying,” may be interpreted as an Immature Con- 
duct Problem factor. Factor II had highest loadings on “truancy 
from school,” “school phobia,” and “suicidal attempts,” and may be 
interpreted as a Phobic-Suicidal factor. Factor IIT, with its highest 
loadings on “destructiveness” and “stealing,” is clearly a second 
variant of the Conduct Problem factor, and will be labelled De- 
linquent Conduct Problem. 


Comparison of Factor Structure of Three Checklists 


Table 5 presents the congruence matrix obtained when congruence 
coefficients were computed from the factor scores of the 102 experi- 
mental subjects on the four PPC factors, the four WGCC factors, 
and the three IJRC factors. The coefficients representing relation- 
ships between factors within tests indicate that even though oblique 
rotations were performed, the factors within tests are virtually un- 
correlated. (The WGCC Conduct Problem and Conflict with Par- 
ents is the major exception.) It would appear that the natural 
simple structure of the checklists is generally orthogonal. The only 
moderately high coefficient representing congruence across check- 
lists is the .79 obtained between PPC Conduct Problem and wacc 
Conduct Problem. 

The canonical variate analysis of two checklists at a time Te 
vealed that there were linear combinations of factors within check- 
lists which would yield higher congruence coefficients across check- 
lists than the original factors did. The canonical variate pattern for 
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TABLE 4 
Rotated Factor Loadings for IJ RC Items 


IJRC Factors 
کے‎ 
Variables I п ш 
L Immature Conduct Problem 
18. Excessive demands for attention 04 ‚07 -.10 
19. Disobedient 58 18 .07 
23. Immaturity 48 -.22 -.03 
15. Poor peer or sibling relationships 
(aggressive, bullying) .45 -.11 .37 
29. General unhappiness .40 .00 -n 
17. Defiant and rebellious 36 .18 .35 
П. Phobic-Suicidal 
2. Truancy from school = -75 -16 
14. School phobia ll E! —.05 
10. Suicidal attempts ll ‚62 -.12 
9. Suicidal threats е 25 4 -3 
3. Absence from home without parents 
knowledge or permission 23 -35 5 
ПІ. Delinquent Conduct 
1. Stealing -.01 .01 52 
7. Destructiveness (other than 
firesetting) Й .08 -.09 5 
16. Poor peer or sibling relationships 12 2m 
(passive, victimized) 1 nic" SA 
28. Shy, withdrawn, timid =,00 ; =» s 
the PPC and the WGCO is presented in Table 6. The first canonical 


variate for the PPC is almost totally defined by the original Con- 


duet Problem factor. However, the original Organic-Somatic factor 
te, probably because the 


also loads upon this first canonical varia em 
Organie-Somatie factor includes items such as “restless = 


“tension,” which would be consistent with the acting out focus of 
ate for the 


the Conduct Problem factor. The first canonical vari 
WGCC is likewise almost totally defined by the original Conduct 
Problem factor. However, the original Conflict with Parents and 
School Failure factors also have moderate loadings on this canoni- 
cal variate. The canonical correlation between factor scores on the 
new combined factors (canonical varjates) is 82 (p < 01). This 
correlation represents no appreciable improvement upon the r of 


79 obtained between the original Conduct Problem factors of the т 
checklists (see Table 5) since the first canonical variate for 
checklist in essence coincides with the original Conduct Problem 
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TABLE 6 
Canonical Variate Pattern for Comparison of PPC and WGCC 
Loadings ол Loadings on 
First Canonical Second Canonical 
Original Factor Variate Variate 
PPC 


I Conduct Problem .97 -.06 

П Personality Problem-Autism 008 S 
11 Personality Problem —.04 “ 
IV  Organic-Somatic .35 au 

WGCC 

I School Failure .35 ^4 

II Conduct Problem " mis 
II  Unhappy-Unsocial «x %4 


IV Conflict with Parents . 


„мм _————— 
Note.— The loadings of each original factor оп а canonical variate are the simple correlations 
between the factor and the canonical variate. 


factor. The second canonical variate of the PPC combines the two 
original factors which contained items involving intra-psychie dis- 
tress (Personality-Austism and Personality Problem), and can be 
interpreted as a Personality Problem factor. The second canonical 
variate of the WGCC is also mainly a combination of the two 
original factors involving inner discomfort. The canonical correla- 
tion between the canonical variates is .60 (p < 01). When the 
original factors were correlated, the highest correlation involving 
Personality Problem items was .45 between PPC Personality Prob- 
lem-Autism versus WGCC School Failure. : 

The canonical variate analysis of the PPC and the IJRC yielded 
only one significant canonical correlation of .59 (p < 01), which 
did, however, exceed the highest coefficient of congruence (48) be- 
tween PPC and IJRC original factors (see Table 1). The first 
Canonical variate of the PPC was again defined almost entirely by 
the original Conduct Problem factor. The loadings of the original 
PPC factors on the first canonical variate were: .97 (Conduct 
Problem), —.19 (Personality Problem-Autism), —.10 (Р Im 
Problem), and .22 (Organic-Somatic). The loadings of the origina 
URC factors on the first canonical variate were 48 2a, 
ps: Problem), .03 (Phobic-Suieidal), and .83 (Delinquen 

nduct). 

The canonical variate analysis of the WGCC and the ИВС 


‚ ‘iso yielded only one significant canonical correlation of .62 (р < 
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01). Again, the factor common to the two checklists was Conduet 
Problem. The loadings of the original WGCC factors on the first 
canonical variate were —.06 (School Failure), .88 (Conduct Prob- 
lem), .04 (Unhappiness-Unsociability), and .61 (Conflict with 
Parents). The loadings of the original IJRC factors on the first can- 
onical variate were .23 (Immature Conduct Problem), .25 (Phobie- 
Suicidal) , and .95 (Delinquent Conduct). 
Discussion 

The first purpose of the present study was to investigate factor 
comparability across item samples with subject sample held con- 
stant. When the results of the separate factor analyses and the 
canonical variate analyses are considered together, the findings sug- 
gest only moderate generality of factors across checklists. 

Previous research established a high degree of replicability for at 
least one basic dimension of childhood psychopathology, namely, 
the internalizing versus externalizing or neurotic versus psychopathic 
distinction (Peterson, 1961; Collins, Maxwell, and Cameron, 1962; 
Jenkins, 1964; Quay, 1964; Achenbach, 1966). 'The present findings 
ате consistent; with this earlier work. The canonical variate analy- 
sis revealed that a Conduct Problem factor was common to all 
three checklists, while a Personality Problem factor was common to 
the PPC and the WGCC. 

However, Systematie variations in item content across the three 
checklists resulted in three descriptions of the basic psychopath- 
ology of the sample which varied more than many descriptions of 
different samples evaluated on a single set of items (e.g., see Quay; 
1006); The IJRC, for example, did not even yield the classic Person- 
ality Problem factor, probably because it contains only two relevant 
as "ahy; withdrawn, timid," and "general unhappiness." It is 
also possible that the psychiatrists who rated the IJRC on the basis 
of шашу Teports were more likely to note the most dramatic 
چا‎ a п contrast to the mothers who themselves filled out the 
e D. ns bes Thus the only WRC factor reflecting dunes 
т : adings for the items “truancy from school, 

school phobia,” and “sucidal attempts.” 
rid be prote а School Failure factor which did not 
ge irom either of the other two checklists. The WGCC contains 
Seven items relating to school behavior and academic achievement, 
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while the PPC contains three, and the IJRC only two. The WGCC 
School Failure factor is similar to Collins, Maxwell, and Cameron's 
(1962), Timid, School Failure factor, Dreger et al.'s (1964) Intel- 
lectual and Scholastic Retardation factor, and Miller's (1967b) 
Learning Disability factor, all of which were obtained from child 


“guidance clinic samples. During the nine-year period of 1951-1960, 


9.5 per cent of 6483 children referred to the Institute for Juvenile 
Research for child guidance services had a learning problem as the 
primary problem area (Lessing and Schilling, 1966, p. 325). Pri- 
mary problem area was not psychiatrically rated for the present 
sample consisting of patients from the same clinic. However, it is 
evident that school failure emerges as an important syndrome when 
there is sufficient item density (as on the WGCC) to permit such a 
clustering. 

The PPC likewise yields an important clinical syndrome that 
does not emerge from either of the other two checklists. PPC Factor 
II, which was labelled Personality Problem-Autism, contains а 
cluster of items such as “sluggishness, lethargy,” “preoccupation,” 
“anxiety, chronic fearfulness,” and “easily flustered and confused,” 
which delineate a regressive behavior pattern not obtainable from 
the WGCC or the IJRC. 

Some of the checklist-specific factors are of questionable status 
and would require further replication before being considered as 
major, stable psychopathological syndromes. For example, PPC Fac- 
tor IV Organic-Somatic with its combination of items reflecting 
tension and irritability with items describing physical isque 
may be a variant of either the Hyperactive, Brain-Injured syn- 
drome of Jenkins and Glickman (1946), or the Somatic Com- 
plaints factor of Achenbach (1966, p. 18). WGCC Factor IV, Con- 
flict with Parents, is probably a variant of Factor II, Conduct 
Problem. In fact, the correlation of .35 (р < .01) between wace 
factors II and IV was the one exception to the general orthogonality 
of the intra-checklist factors (see Table 5). 

The second purpose of the study was to 
mation regarding the major factors obtaina 
particular attention being given to the item content of the Inade- 
quate-Immature or Pre-Psychotie factor. The Conduct Problem 
factor is extremely stable even when it is derived from varying 
Subsets of the PPC items. Previous investigators eliminated in- 


provide additional infor- 
ble from the PPC, with 
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frequently endorsed items, with 10 per cent generally being the 
minimum acceptable percentage of endorsment (Quay and Quay, 
1965; Quay, 1966). Thus the number of items actually subjected 
to factor analysis varied from 26 for the seventh-grade sample 
studied by Quay and Quay (1965) to 55 for the fifth- and sixth- 
graders studied by Peterson (1961). All 58 items were used in the 
factor analyses conducted for the present study. Yet the Tucker 
coefficients reported in Table 2 all exceed .80 when the Conduct 
Problem factor is compared across studies, 

The clustering of items representing internalized symptoms shows 
much less stability, Historically, the Personality Problem factor 
was identified first (Peterson, 1961). The next study utilizing the 
PPC (Peterson et al., 1961) produced two factors upon which 
the internalized items loaded heavily. Thus “sluggishness, lethargy” 
and “preoccupation,” which had been among the 15 symptoms 
interpreted as defining the Personality Problem factor in the original 
study, now loaded upon the new third factor, labelled “Autism.” In 
subsequent studies, the third factor, which was re-labelled “Inade- 
quacy-Immaturity,” proved to be rather unstable in item con- 
tent (Quay and Quay, 1965; Quay, 1966). In fact, Quay and Quay 
(1965, p. 218) reported a Tucker coefficient of .44 for the congruence 
between the Inadequacy-Immaturity factor obtained in their eighth- 
grade sample and the same factor obtained from Quay's adolescent 
delinquent sample. However, a Tucker coefficient of .67 was reported 
for the congruence between the eighth-grade Inadequacy-Immatur- 
ity factor and the seventh-grade Personality Problem factor. In 
the Present study, Factor II was labelled Personality Problem- 
Autism in order to emphasize the mixture of anxious, worrying, 
seli-depreciating neurotic characteristics with the withdrawn, un- 
responsive, regressive items. Factor II was found to be almost 
equally congruent with the Personality Problem and Inadequacy- 
Immaturity factors obtained in previous studies (see Table 2). 
3 Bun the present and previous findings in regard to the internal- 
12115 items on the РРС may be a reflection in the area of childhood 
Psychopathology of a phenomenon noted by Eysenek (1955). When 
four objective tests were administered to 20 normals, 20 neurotics, 
ando psychotics tanging in age from 20-40, two orthogonal canoni- 
cal еВ were obtained and labelled “neuroticism” and “psychot- 
cism.” It was found that high psychoticism scores were nearly 


а —— y 
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always combined with high neuroticism scores, though high neuroti- 
cism-low psychoticism was the characteristic pattern for the neurot- 
ics in the sample. The validity of this interpretation can be evalu- 
ated only by administering the PPC to a sample containing 
comparable proportions of neurotic and psychotic or pre-psychotic 
children. 

The findings of the present study highlight the need for data 
covering the full range of psychopathology from an adequate sam- 
ple of clinical types. The collection of such data must be preceded by 
careful consideration of the problems involved in defining the uni- 
verse of symptom items and sampling it adequately. Loevinger 
(1965) has described the difficulties inherent in conceiving of items, 
rather than persons, as a population to be sampled. The majority 
«7 investigators concerned with basic symptom factors have ap- 
peared to be guided by the implicit view that the universe to be 
sampled consists of all presenting symptoms reported by parents 
or other observers of disturbed children. However, there has been 
insufficient attention paid to the consequences of sampling from 
this hypothetical universe in various Ways. 

The three methods used to construct the symptom checklists used 
in the present study (selecting the most frequently reported symp- 
toms of representative clinic cases, selecting the symptoms pro- 
viding statistically significant discrimination between normal and 
clinic cases from a pool of items provided by 50 mothers of clinic 
cases, and selecting subjectively judged important symptoms) pro- 
duced three quite different item samples. When diagnostic symptom 
checklists are being constructed, sampling methods should be chosen 
on the basis of their ability to produce an item sample that is opti- 
mal for the major purposes of diagnostic classification: parsimoni- 
ous description, investigation or etiology, determination of progno- 
sis, selection of treatment method, and evaluation of treatment 
outcome, All of these purposes require factor gener ality across sub- 
ject and item samples to an extent that will permit the accumula- 
tion of comparable data and the applied use of research findings in a 
variety of clinical settings. Therefore, it is necessary to do more than 
merely avoid obviously deficient sampling methods, such as the 
haphazard procedure used to compose the IJRC, and questionable 
methods such as the use of only 50 mothers to generate symptom 
items for the WGCC. Even the results of a single systematic 
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sampling of items such as that, used in the construction of the PPC 
may require supplementation with items defining major factors 
obtained from other samples. 

The accumulation of independent studies, each based on differ- 
ent item samples, derived from different subject samples, can be 
regarded as the equivalent of many samplings from the universe of 
symptoms of disturbed children. Major factors which are consistently 
obtained aeross several independently and systematically obtained 
sets of symptom items may be considered to constitute the ideal 
factors for organizing the available pool of symptom data. Any spe- 
cific item sample or checklist whose item content will not yield one 
of these replicated factors can be considered to be unrepresentative 
of the hypothetieal universe of symptoms, and thus in need of 
modifieation. The construction of new symptom checklists should 
ideally involve the use of marker variables from previous studies as 
well as symptom items obtained from the subject sample under 
immediate consideration. Miller (1967a) provided a noteworthy 
example of the use of this principle in constructing the Louisville 
Behavior Check List for Males 6-12 Years of Age. 

In accordance with the line of reasoning just presented, it would 
be advisable to add more school performance items to the PPC and 
more autism and inner distress items to the WGCC. It would then be 
possible for each of these checklists to yield the Personality Prob- 
lem, Conduct Problem, Learning Disability, and Autism factors 
which have the greatest generality across studies (Peterson, 1961; 
Collins, Maxwell, and Cameron, 1962; Quay and Quay, 1965; Achen- 
bach, 1966; Miller, 1967a). The IJRC does not provide sufficiently 
balanced coverage of symptoms to serve as a diagnostic instrument. 


Summary 


The purposes of the study were: (a) to compare the symptom 
factors obtained when three different sets of items were utilized for 
behavior ratings of the same sample of children, and (b) to provide 
additional information regarding the major factors obtainable from 
the widely used Peterson Problem Checklist. Subjects were 102 
child guidance patients, aged 10 years 0 months through 12 years 
11 months. All were rated on the Peterson Problem Checklist (PPC) 
and the Wichita Guidance Center Checklist (WGCC) by their 
mothers, and on the Institute for Juvenile Research Checklist (IJRC) 
coded by the examining psychiatrist on the basis of the mothers’ 
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oral report, of their children's symptoms. The PPC yielded four 
factors: Conduct Problem, Combined Personality Problem-Autism, 
Personality Problem, and Organic-Somatic Problem. The WGCC 
yielded four factors: School Failure, Conduct Problem, Unhappiness- 
Unsociability, and Conflict with Parents. The IJRC yielded three 
factors: Conduct Problem, Phobic-Suicidal Syndrome, and Delin- 
quent Conduct Problem. Even when the techniques of canonical 
variate analysis were used to combine factors within checklists in 
order to maximize congruence across tests, only moderate generality 
of factors across tests was obtained, A reconstituted Conduct Prob- 
lem factor was common to all three checklists and a reconstituted 
Personality Problem factor was common to the PPC and the WGCC. 

The results of the study were considered to highlight the need 
for careful attention to the problem of symptom item sampling in 
the construction and revision of symptom checklists. It was sug- 
gested that symptom checklists with the degree of factor structure 
comparability required for diagnostic purposes can most readily be 
constructed by supplementing the results of any single sampling of 
the universe of symptom items with marker variables defining ma- 
jor factors obtained in other studies. 
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INDIVIDUAL DIFFERENCES IN DIAGNOSTIC 
JUDGMENTS OF PSYCHOSIS AND NEUROSIS 
FROM THE MMPI? 


NANCY WIGGINS 
University of Illinois 


Tue numerous comparisons of clinicians with computers in fore- 
casting behavior (usually to the computer’s advantage) have led 
investigators to the more fundamental problem of how different 
clinicians arrive at their predictions; for а recent review see 
Goldberg (1968). This concern with the clinical judgment process 
has necessarily involved the notion of individual differences among 
clinical judges. As with most kinds of data, individual differences 
in clinical judgments can be treated in two ways. Should differences 
among judges exist, these differences can be treated as error, and 
an “average” judge would be said to provide the most meaningful 
summary for all of the judges. This approach provides general- 
izability to other clinicians similar to the ones being studied, at 
the expense of ignoring individual differences. Thus, it is 
that the mean or average judgment is representative of all of the 
judges in the sample, or, alternatively, that the judges are repli- 
cations of one another within error of measurement. 

However, individual differences become important in the context 
of studying the clinician’s judgmental processes. It was in this 
spirit that Hoffman (1960) compared the judgmental models of 
two subjects asked to judge the intelligence of persons represented 
by a series of nine-cue profiles. Hoffman provided a “paramorphic 
model of each judge by obtaining the regression weights from the 
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multiple regression of the nine input cues on the judgments them- 
selves. These weights provided a model indicating the relative 
emphasis a single judge placed on each cue in this context. The 
major problem with enumerating and comparing the results of 
clinical judges, individually, is that of generalizability. Thus, one 
would be willing to generalize the results, or model, for a single 
judge only to that same judge for future occurrences of a similar 
task. 

A slightly different approach to individual differences in human 
judgment is found in the theoretical models of Tucker and his 
associates (e.g., Tucker and Messick, 1963; Tucker, 1960). By 
specifying conceptual types of judges, this approach allows for 
individual differences to emerge, while providing considerably more 
parsimony than is the case when each judge is treated individually. 
The results, or model for different types of judges can be general- 
ized to similar samples of judges; it would not be expected that an 
"identical twin" of any single judge would emerge in a new sample 
of judges but only that a similar type of judge would be general- 
izable. 

Specifically, Tucker’s approach involves the factor analysis of 
sums of squares and cross-products among judges across judgments. 
This is directly analogous to an obverse factor analysis of subjects 
rather than variables. Subject factors are then positioned in such 
а way as to represent meaningful “conceptual” or “idealized indi- 
viduals” (Cliff, 1968). Thus, a subject factor (or “theoretical 
subject”) is passed through real subjects in such a way as to 
represent а subgroup of response-homogeneous judges. Should the 
number of subject factors be one, it would be difficult to argue 
the case for individual differences in judgment. On the other hand, 
should the number of subject factors be greater than one, it would 
be important to isolate meaningful “idealized” types and to deter- 
mine the personality correlates of these idealized individuals. 

Although a variety of studies using Tucker’s model have attested 
to the importance of individual differences in judgmental view- 
points (Messick and Kogan, 1966; Pederson, 1962; Skager, 
Schultz, and Klein, 1966; Walters and Jackson, 1966; Wiggins, 
1966; Wiggins, Hoffman, and Taber, 1969; Wiggins and Fish- 
bein, 1969; Wiggins and Wiggins, 1969 ; Snyder and Wiggins, 
1970), this model has not been widely applied to the area of 
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clinical judgment. Using Tucker’s idealized type approach, the pres- 
ent study involved an analysis of clinical judgment data orig- 
inally collected by Meehl (1959). In particular, 29 clinicians were 
required to make diagnoses of psychosis vs. neurosis on an 11-step, 
forced-normal distribution for 861 MMPI profiles from seven hos- 
pitals and clinics around the country. One study utilizing these 
data (Horn and Stewart, 1968) approached the problem of possible 
individual differences among these 29 judges in a manner similar 
to that taken by the present study. Horn and Stewart factor 
analyzed the diagnostie judgments of the 29 clinical judges, ав 
well as the actual criterion (hospital diagnosis), across the 861 
MMPI profiles and retained three subject factors. А varimax rota- 
tion of these three factors indicated that the first, factor was also 
marked by the criterion, suggesting a validity component, of judg- 
ment. No interpretation of the remaining two factors was made. 

This inability to interpret the second and third subject factors 
is not surprising in view of the fact that few personality or judg- 
mental measures of the clinicians were utilized in the analysis. 
Except for the variable “amount of clinical training” (which was 
unrelated to any of the factors), Horn and Stewart had no vari- 
ables by which to identify their subject factors. This is particularly 
unfortunate in light of the considerable body of research on these 
same 29 clinicians (Goldberg, 1965, 1968, 1969, 1970; Wiggins 
and Hoffman, 1968). 

In addition to the lack of personality and judgmental correlates 
for Horn and Stewart’s three subject factors, another aspect of чә 
analysis should be noted. Horn and Stewart performed a varimax 
rotation of their three subject factors. There is no guarantee that 
these varimax factors represented meaningful “idealized individ- 
uals,” i.e., passed through real subjects. А plot of Horn and 
Stewart's varimax factors revealed that these factors were rather 
poor representations of real subjects; the factors did not рав 
through, or near, any of the 29 judges. Thus, the three E 
factors did not represent real clinicians, and vith the exception ^ 
the first factor, the remaining two factors were uninterpretab n 

In the present study these data were re-analyzed in Tight 3 
the foregoing considerations. It was hypothesized that lam 
differences in judgmental viewpoints would emerge. It was further 


predicted that such differences would be manifested in significant 


202 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


personality and judgmental correlates of the idealized subject types, 
when these idealized individuals represented real judges in the 
subjects’ factor space. Psychometrically, these hypotheses can be 
restated: (a) More than one subject factor will be necessary to 
account for the sums of squares and cross-products among clini- 
cians; (b) these factors can be rotated in such a way as to represent 
meaningful idealized individuals, i.e., subgroups of response-homo- 
geneous judges; and (c) these different idealized individuals will 
differ on available judgmental and personological measures. At 
the least, it would be expected that if different idealized individuals 
emerge, their paramorphie judgmental models should distinguish 
among them. 


Method 


Subjects. The data utilized in the кө study were originally 
collected by Meehl (1959) and have been described extensively 
elsewhere (Goldberg, 1965, 1968, 1969, 1970). Thirteen of the 
subjects were Ph.D. clinical psychologists (staff) and the re- 
maining 16 subjects were predoctoral trainees at the University 
of Minnesota. These 29 judges were given seven samples of MMPI 
profiles, one sample at a time, and were asked to sort each group 
of profiles on an eleven-step forced-normal distributribution rang- 
ing from most (likely) neurotic to most (likely) psychotic. Each 
MMPI profile consisted of eight clinical scales (excluding Mf) and 
three validity scales, The only information given the clinicians 
was that the samples represented males under psychiatric care who 
were diagnosed as psychotic or neurotic. In fact, the percentage 
of psychotics in each sample ranged from 37% to 64%, with а 
median of 51% over all 861 profiles. 

Judgmental and personological variables. From the extensive 
research on these 29 clinicians, a variety of judgmental and 
personological variables were available. The judgmental variables? 
taken primarily from Goldberg (1970) were obtained separately 
for each of the 29 clinical judges, based on analyses using the 


total group of 861 MMPI profiles. Goldberg (1970) described 
these variables as follows: | 


a. Validity coefficient of the judge: The correlation betwee? 
the judge’s predictions and the actual criterion values. 


2 The data for the present study were obtained from Lewis R. Goldberg: 
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b. Linear predictability of the judge: The multiple correlation 
between the eleven MMPI scale scores and the judge's predie- 
tions. 

c. Reliability of the judge: Correlations between а judge’s re- 
sponses to 100 pairs of empirically matched profiles. 

d. Validity of the judge's linear model: The correlation between 
the predicted judgments based on the judge's linear regression 
model and the actual criterion values. 

e. Linear component of judgmental accuracy: The correlation 
between the predicted values from the judge's linear regression 
model and the predicted values for the linear model relating 
MMPI scale scores to the criterion. This variable is perfectly 
correlated with the validity of the judge’s model (d) and provides 
an alternative interpretation to that measure. 

f. Nonlinear component of judgmental accuracy: The correlation 
between residual values of the criterion and the residuals of the 
judge’s predictions after the linear components are removed. 

g. Incremental validity of model over judge: The arithmetic 
difference between the validity of the judge's model and the actual 
validity of the judge (d — а). 

h. Relationship to composite judge: The correlation between 
the clinician's judgments and those of the “composite” judge (the 
average of all of the 29 judgments for each profile). 

Most of these judgmental variables stem from Tucker's (1964) 
and Hammond, Hursch, and Todd's (1964) formulation of clinical 
judgment in terms of the Brunswick lens model. Further, these 
variables have been shown to be of considerable importance in 
Goldberg’s (1970) recent work comparing the validity of the 
judge with the validity of his model. | 

In addition to the above variables, a few personological and 
demographic variables were obtained: sex, staff vs. trainee, and 


ratings of “likeability” and “relative intelligence” by а psychol- 


ogist who knew all but two of the 29 judges. Moreover, for each 
ghts for each of the eleven 


judge the standardized regression Wel 

MMPI scales were obtained by regressing the 11 MMPI ve 
scores onto the 861 judgments. These weights constituted eleven 
separate variables for each judge. 


з Goldberg, L» R. Personal communicatione 
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Method of analysis. First, sums of squares and cross-products* 
were obtained among the 29 judges across the 861 MMPI profiles. 
This matrix of cross-products was subjected to a principal com- 
ponents analysis; the number of factors retained in this analysis 
was based on an examination of the successive distribution of 
eigenvalues, as well as on Tucker's (1966) mean square ratio test 
for factor significance. Each of the retained unrotated subject 
factors was correlated with all of the outside judgmental variables 
in order to identify the unrotated principal components. Next, these 
principal eomponents were hand rotated to represent meaningful 
idealized individuals (Cliff, 1968) by passing each vector through 
a single judge. The projections of each of the 29 judges on the 
idealized individuals (vectors) were obtained, a procedure which 
is directly analogous to factor rotation in the space of individuals. 
Again, the correlation of each idealized individual with all of the 
judgmental and personological variables was obtained in order to 
identify the different judgmental points of view. 


Results 


Three subject factors were extracted from the matrix of inter- 
subject cross-products. Two criteria were invoked to determine 
the “significance” of these three subject factors: (a) The distri- 
bution of successive eigenvalues indicated a large drop in variance 
after the third factor, with the remaining successive differences 
approaching an arbitrarily small constant. The eigenvalues were: 
903055, 7020, 3409, 2288, 2191, --- , 484. The first three eigenvalues 
accounted for 97 per cent of the sums of squares; (b) Tucker's 
(1966) mean square ratio test, an approximate F-test for factor 
significance, indicated significant F-ratios (р < .05) for the first 
three factors. Although the first eigenvalue is large relative to the 
remaining ones, this is due to the recovery of the means by the 


4 

Gu C ла argued that when individuals аге factored across 8 
БОША, Б ге Seale, sums of squares and cross-products among subjects 
should be utilized instead of intercorrelations. Thus any differential mean 
effect would be uncovered by the first principal сошродец, thereby maxi- 
mizing the possibility of individual differences, However, for the present data, 
шш eal ны and as such the means Were 

: ‚ Although the present analysis utili ss-products 
following Tucker and Messick’s individual dioses model (1968), it is 
noted that the results would be quite similar to an analysis of intercorrelations. 
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first principal component when cross-products, rather than corre- 
lations, are factored. 

Table 1 presents the significant (p < .05) correlations between 
each of the judgmental and personological variables and the 
three unrotated principal components. The first unrotated principal 
component was highly correlated with the judges’ relationship to 
the composite judge. This is not surprising since the first compo- 
nent was a general factor which essentially recovers the original 
means in the analysis. These means, of course, would be related 
to the composite judge. In addition, the first principal component 
correlation between this component and the judges’ validities (r= 
Component II represented a validity factor, as seen by the high 
correlation between this component and the judge’s validities (rz 
87). An even higher correlation of this principal component was 
found with the validity of the judges’ models (r = 94), indicating 
that this factor represents essentially valid judges whose models 
tend to outperform them in predicting the actual criterion. This 
result can be found in Goldberg’s (1970) comparison of judges 
and their models. Thus, the more valid the judge, 25 represented 
by principal component II, the more valid his model. In addition, 
the judges at one pole of component II tended to be ‘liked’ and to 
be rated as relatively more ‘intelligent’ than those at the other 
pole. Unrotated component ТЇЇ represented both а lack of linear 
predictability as well as a lack of reliability. Neither sex nor 


TABLE 1 
Significant Correlations belween Three Unrolated Principal 
Components and Judgmental Variables 
Principal Components 
ш 

Judgmental Variables I E 
l. Validity 79 31 —.02 
2. Linear predictability `6 = .57 
3. Reliability s .94 
4. Validity of model 4و‎ 
5. Linear component of accuracy 
6. Incremental validity of model .52 

over judge 
T. Correlation of judge with 80 46 

ccm osite judge У 46 
P “Likeability” .39 


- “Intelligence” 
Note.—Only correlations > 36 (p < 05) have been tabled. 


Since the first principal component tended to be а general {acter 
with little variability among subject loadings, а plot of the second 
and third unrotated principal components was examined. This plot 
is presented in Figure 1. As noted earlier, component 11, а bipolar 
, diseriminated the most valid judges from the least valid 
and factor three pulled out a single idiosyncratic 
drawn in Figure 1 which connected three judgew 
and third principal components. The triangle 
illustrates the configuration of subjects with respect 
and third unrotated principal components. 
reference to Figure 1, the three unrotated principal come 
were rotated in such a way that each factae directly 

each of the three marker judges (end points of the 
). projections of all of the remaining judges on these 
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Figure 1. Plot of 29 clinicians on principal components П and Ш. 
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rotated factors were obtained. Each rotated factor thas cem» 
ly represented a meaningful "idealised individual" that is, 
factors were relatively unipolar with respeet to all of the 
, they were marked by three real judges, and they сно 
all of the subjects in the subject factor space. In addition, 
factor rotation was performed in such а way that eseh of the 
marker judges had a loading of unity on the corresponding 


| Nete—Only correlations > 20 (p < 06) bare bem бадй. 


three idealized individuals and each of the 

nological variables. Idealised individual 1, the 

the more valid judges, represented the validity 

t judgment (r = 98). This idealised divit! Ма 

his judgments with the 

ғ = 97). Clinicians represented by this idealised 
to be liked and to be rated as relatively lis 

the other judges. The significant correlation with 

| validity of the judge's model indicated that the models 

judges were more valid than their own judgments. 


1 
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* It is noted that since the three marker judges were not orthogonal a sen 
in the subject factor space, the on the 
correlated, The intercorrelations between the d 
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r= — 26. 
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Idealized individual II represented judges who were not valid 
but whose judgments were linearly predictable from a multi- 
ple regression equation relating MMPI scales to judgments. The 
models for these judges were considerably worse in predicting the 
criterion than their own judgments, both model and man being 
relatively invalid. This group of judges tended to be disliked. 
Idealized individual III, marked primarily by a single, idiosyn- 
cratic judge, represented an unreliable judge whose judgments were 
not linearly predictable from the MMPI scales. This idealized 
individual was also negatively related to the composite judge. 
Although idealized individual III did not correlate significantly 
with validity (т = .34), the single judge marking idealized indi- 
vidual ПІ was the second most invalid judge with a validity 
coefficient of .15. 

A comparison of these three idealized types of judges indicated 
that for the valid idealized judge, his model was more valid than 
his judgments. The reverse was true for the invalid but linearly 
predictable idealized judge: his model was worse than his judg- 
ments. Neither of these two idealized judges was correlated signif- 
icantly with reliability. In comparison, for the unreliable and less 
linearly predictable idealized judge, the validity of his model fared 
about as well as his own judgments. These results support Gold- 
berg’s (1970) theoretical comparison between man and his model: 
when man has any positive validity the model will outperform 
the man. As man becomes more linearly predictable, the validity 
of his model approaches the validity of his own judgments. 

The variables which did not discriminate among idealized indi- 
viduals were sex, amount of training, and the nonlinear component 
of accuracy. Although the present idealized individuals were chosen 
in such a way as to represent three single judges who marked the 
second and third principal components, it would be possible to 
rotate the subject factors in such a way as to maximize the corre- 
lations of the idealized individuals with the various judgmental 
variables. The present positioning of idealized types indicated the 
major judgmental correlates to be validity, predictability and reli- 
ability; Although both idealized individual II and idealized 
individual TII tended not to be valid judges, they differed in that 
idealized individual III also represented the least reliable judges: 
It is possible that idealized judge II simply weighted the MMPI 
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scales incorrectly, whereas the invalidity of idealized judge Ш 
stemmed primarily from his unreliability. 

Some light is shed on the issue of scale weighting by examination 
of Table 3. For each judge a multiple regression equation between 
tlie eleven MMPI scales and the 861 profile judgments yielded 
a set of 11 regression weights. These weights reflect the manner 
in which each judge used the 11 MMPI scales in arriving at his 
prediction of the criterion, These standardized regression weights 
for each scale were correlated with the judges’ projections on the 
idealized individuals. The resulting significant (р < .05) corre- 
lations between regression weights and judges’ loadings on the 
idealized individuals are presented in the left side of Table 3. 
These correlations indicate the relative importance or the varia- 
bility a given MMPI scale has for a given idealized individual. 


TABLE 3 


A Comparison of Idealized Individuals and Marker Judges on Regression Weights 
‘Attached to the Eleven MMPI Clinical Scales 


Standardized Regression Weights 
Significant Correlations سے‎ 
Idealized Individual Marker Judges 
Clinical 
Scales I T ЇЇ Criterion I Hm 4ê 
L 12 03  .0  .02 
F 42 ai QU 19a, 2000 
К ‘06 00 i —@ 
Hs —.39 .68 —.47 —.04 met =й 
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Pd 06 0 ш -. 
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ates that idealized individual I rep- 


resented judges who were positively correlated with the ier 
Weights for Sc and Pa and negatively correlated vith Pt, Hy, 4 
and Hs. For purposes of comparison, the middle section of Table 


presents the standardized regression coefficients of the eleven Te 
in predicting the actual criterion. Goldberg (1965) found tha 
a simple combination 


the best predictor of the actual criterion was 
of five MMPI scales: L + Pa + $6 = Hy — Pt. Idealize 


Inspection of Table 3 indie 
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individual I was correlated with four out of the five most valid 
scales in the appropriate direction of criterion prediction. In аф. 
dition, it can be seen that idealized individual I was negatively 
correlated with Hs and D, scales of practically no validity. With 
the exception of these latter two scales which were overweightéd, 
the most valid idealized individual tended to correlate with the 
most valid scales. 

Idealized individual IT, representing invalid judges, was positively 
correlated with the regression weights for Pt, a scale of relatively 
large negative validity. Hence a potentially valid scale was weighted 
in the improper direction. In addition, idealized individual II was 
significantly correlated with D, Hs, and Р, three scales with prac- 
tically no validity. With the exception of Pt which was correlated 
with idealized individual II in the reverse direction to its validity, 
idealized individual II was not related to any of the scales entering 
into Goldberg's best set of five scales. Idealized individual III, rep- 
resenting the less reliable judges, was significantly correlated with 
Pt, Pa, and Hy in the reverse direction to these scales’ validities. 
This idealized judge also overweighted Hs. Although both ideal- 
ized individuals П and III had a tendency to be invalid, their in- 
validity would appear to stem from variability in the weighting of 
quite different MMPI scales. 

It is important to note that the correlations between idealized 
individuals and scale regression weights are not a direct test of the 
seales utilized or weighted in the actual judgment task. For example, 
if а scale regression weight does not correlate significantly with an 
idealized individual this could be due to two reasons: (a) the 
marker judges do not weight the scale; or (b) all of the judges 
(marker and nonmarker) weighted the scale, and hence the regres- 
sion weight for the scale lacked sufficient variability for a give? 
idealized individual. On the other hand, even if a scale does сог- 
relate significantly with an idealized individual, this correlation alone 
would not be an indication of the actual magnitude of the regression 
weight attached to the scale by the marker judges; such a correla- 
tion would only indicate the relative variability of judges’ projec- 
tions on an idealized individual with respect to the regression 
weights for that scale. 

In light of the foregoing considerations, the regression model 
for the single judge marking each idealized individual was examined. 
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tse were the actual judges through which the rotated factors 
passed, and as such these judges would be considered an ade- 
ate representation of the corresponding idealized individual. By 
examination of these single judges it is possible to determine 
lireetly the judge's weighting system for the eleven MMPI scales. 
The right hand side of Table 3 presents the standardized regression 
Weights for each marker judge for the eleven MMPI seales re- 
messed onto the 861 MMPI judgments. These weights indicate the 
lative emphasis a given judge placed on each scale їп making his 
üdgments of psychosis and neurosis for the profiles, The scale 
"Tegression weights for the actual criterion provide а basis for deter- 
Mining the adequacy of each judge's model. 
- Marker judge I, who was among the most valid judges, correctly 
Weighted four out of the five most valid scales (Goldberg, 1965). 
In addition, he tended to slightly overweight D and Hs. Marker 
judge II, the least valid but moderately reliable judge, correctly 
sighted Sc, and Pa. However, this judge tended to 
Pt, Hy, and L, while overweighting F, K, D, and Pd. Although 
Marker judge II correctly weighted two out of the five best scale 
redictors, his invalidity presumably stemmed from his inappropri- 
ate weighting of seven scales. Marker judge III, an invalid and un- 
liable judge, correctly weighted Sc, although he weighted Pt, а 
negatively valid scale, with a fairly large positive weight. He tended 
fo overweight Hs and underweight L, Hy, and Pa. The most ob- 


ў MEE tions with respect to 
the corresponding idealized individual correlat х «eh Marker 
he scale regression weights. Both idealized individual I А 

is, all of the judges’ ргојес- 
3 significantly correlated with 


did not hold, however, д 
example, both Marker judges II and Ш placed heavy emphasis on 
Sc. However, Sc did not correlate significantly with eth 
individual II or III. Similarly, some scales which did aq 
Significantly with the idealized individuals II and III did not yi 
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large regression weights for the corresponding marker judges. Pos- 
sible noncomparability between marker judge and idealized indi- 
vidual сап arise from the fact that idealized individual correlations 
сап be attributed to two, confounded sources: (a) the weighting 
of the scale, and (b) the variability of the scale. As noted previously, 
the idealized individual correlations simply indieate the relative 
variability of scale weighting for all of the judges' projections on a 
given idealized individual; the models for the marker judges yield 
the actual magnitude of the scale weights for a single judge mark- 
ing an idealized individual. 
Discussion 

In comparing the present data with those of Horn and Stewart 
(1968), recall that two of Horn's three subject factors were totally 
uninterpretable, and Horn's varimax subject factors did not pass 
through real individuals in the factor space of subjects. The present 
data point out the usefulness of identifying idealized individuals on 
the basis of judgmental and personological variables, as well as real- 
istically representing their judgment strategies by considering the 
models for the marker judges. It could even be argued that rather 
than treating each of the 29 judges separately in examining their 
judgmental strategies, only three idealized types need be studied. 
Thus, given a particular constellation of judgmental and personolog- 
ical correlates, it would be possible to predict an idealized judge's 
model for a set of new data, provided that the idealized judge ех- 
hibited a similar pattern of correlations with the judgmental and 
personologieal measures. 

This is not to say, however, that idealized individuals would 
necessarily emerge in а new study with judgment models identical 
to those of the present study. This should only occur when, in fact, 
the pattern of outside correlates is identical to those found in 
the present study, an unlikely finding in a new study. However 
it can reasonably be predicted that new data should exhibit indi- 
vidual differences in judgmental viewpoints for judgments of psy- 
chosis vs. neurosis from the MMPI. Further, it would be predicted 
that three major judgmental variables would distinguish the 
different perceptual viewpoints: validity, predictability and relia- 
bility. Should different types of invalid judges emerge, the weights 
they place on the MMPI scales as well as their relative reliability 
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should also distinguish among them. If а rotation of the subject 
factors could be found which replicated the present pattern of 
judgmental and personological correlates, the judgment models 
for the idealized types should also replicate. These hypotheses 
are currently being tested on data similar to those used in the 
present study. Since the present data consisted of a fairly homo- 
geneous group of judges (Minnesota-trained clinicians), a more 
heterogeneous sample might well lead to more than three idealized 
judges. 

Although most of the varibles used in the present study were 
primarily judgmental variables based on the same data which were 
factor analyzed, the present technique suggests a means whereby 
types of clinical judges might be identified a priori to performing 
the judgment task. It is suggested that future research be directed 
to uncovering a variety of personological and intellective correlates 
of the idealized judges, If constellations of characteristics associ- 
ated with different types of clinical judges were isolated, the 
possibility exists for a priori typological identification for any 
given judge. 
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А SPECIAL REVIEW OF BUROS' PERSONALITY 
TESTS AND REVIEWS 
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Social control has never been all that popular: The Saxon gentry 
dubbed William’s book Domesday (Doomsday) because it posed a 
record from which there was no appeal. In some quarters Buros’ 
efforts have occasioned something like the same dismay. In the 
Second Mental Measurements Yearbook (MMY) Buros re- 
printed a selection of letters that bitterly criticized the first MMY 
and he reported that some test publishers were reluctant to forward 
samples for review. By now the MMY series is so prestigous that 
Buros has little difficulty in securing the cooperation of publishers. 
But active resentment may merely have mellowed into benign neg- 
lect for Buros’ whole undertaking seems never to have received 
foundation support and his own labors may never have been 
properly acknowledged by the educational and psychological com- 
munities. 

One way to acknowledge Buros’ contribution is to buy his book. 
Since it is expensive, there will be a temptation to ask one's local 
librarian to buy the book instead, in case anyone wants to read 
up on a particular instrument. PTR is unquestionably useful for 
this purpose. Teachers and research workers in the area of рег- 
sonality studies may find, however, that it has other uses that will 
amply repay the cost of ownership. Buros’ pages present an im- 
plicit history of personality assessment that is richer and more 
suggestive than any other the writer has come upon. This review 
will attempt to make parts of this implicit history explicit in order 
to assess the future prospects of such assessment. For this purpose 
itis necessary to structure the data in Buros' book. The reader is 
warned that a closer acquaintance with the work might suggest 
different structuring principles to him and that these could lead to 
different and possibly less pessimistic conclusions. 


Exponential Growth 


In surveying this book it is helpful to ask questions. The writer 
wanted to know whether (and in what sense) there has been progress 
in the art of personality measurement in the last 25 or 30 years 
As a first step it seemed desirable to look at the rates at which copy- 
righted measuring devices and new publications on these have at- 
cumulated over the years. Since the PTR Test Index gives the date at 
which each instrument was copyrighted it was possible to make 
cumulative frequency distributions over years. This was done 


FRED DAMARIN 217 


separately for projective апа nonprojective instruments. Table 1 dis- 
plays the total number of each sort of device copyrighted up to 
1925, up to 1930, and up to the end of each subsequent five year 
interval through 19653 Figures for 1966-1968 are given, but these 
may be incomplete and the figures for 1970 are projections from 
trends established up through 1965. These data appear in column 
2 and 3 of Table 1. Columns 4 and 5 record the eumulative number 
of references available for the nonprojective and projective mea- 
sures separately beginning in 1940. These figures were calculated 
from a frequency table that appears on page xxiii of Buros’ intro- 
duction to PTR. The last two columns in our Table 1 display 
ratios of references to instruments (separately in the nonprojec- 
tive and projective areas) for each data point since 1940. 

The growth trends for instruments have been graphed in Figure 
1, the trends for references appear in Figure 2, and the trends for 
references-per-instrument appears in Figure 3. All of these graphs 
are semi-logarithmie so that the scale of years on the X axis is 
linear while the scale of cumulative instruments, cumulative refer- 


TABLE 1 


Cumulative Instruments Copyrighted, Cumulative References, and Ratios 
of References to Instruments 


Tests-Copyrighted References References/Test 
Year Мопргој. Proj. Nonproj Proj. Nonproj. Proj. 
q) @) (3) @ (5) (6) @) 
(1970) 510 95 (13,000) (8,500) (25.5) (93.6) 
1968 E у, 10,947 7,753 27.0 83.4 
1965 369 89 8,365 6,786 22.7 16.2 
1960 301 78 4,942 5,191 16.4 66.6 
1955 223 58 3,119 3,609 14.0 62.2 
1950 179 37 2,031 1,724 11.3 46 T 
1945 139 16 1,214 653 8.7 40. 
1940 100 10 714 225 7:3 22.5 
1935 44 A ME 28 no = 
1930 9 4 zu LS ыг un 
1925 2 2 аз = — — 


a i jective (nonproj.) and 
Note.—Figures for 1966—1968 may be incomplete. Data given for nonprojec Е 
Projective (ргој.) instruments at intervals since 1925. Parenthesized values are айо а 
from recent growth rates. Dashes indicate missing dates, All information taken fr uros 


STE re B H 
PT i i i the total number of instruments available at 
deci e Шу with comparable figures in 


any given time need not always agree exac 1 
TR. Buros seems to count scoring services 45 separate instruments but 
this review does not. 
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Figure 1. Growth curves for personality instrument copyrights. The log- 
arithm of the total number of copyrights issued is plotted against the date 
at five year intervals from 1925 to 1965, The solid line represents non- 
projective devices while the dashed line represents projective measures. The 


points representing 1970 аг i ол 
АЕ АА e crude extrapolations. Тһе data are 


ences, and ratios on the three Y axes are logarithmic. Markedly 
linear trends—with some inflection points—are visible in all three 
graphs. Over long periods of time the number of instruments, the 
number of references, and the number of reference per instrument 
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tend to grow exponentially: They double every few years. Over 
shorter periods of time there are often perturbations that are not 
displayed in figures 1, 2 or 3. During and after the second World 
War, for example, the production of both instruments and re- 


50,00! 


10,000 
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Figure 2. Growth curves for personality instrument references, The fee: 
arithm of the total number of references available is plotted Sen 1 
date at five year intervals from 1940 to 1965. The solid line represents refer: 


jecti i i i ferences to 
ences to nonprojective devices while the dashed line represents ге 7 
Projective measures. The points representing 1970 are crude extrapolations. 


H 

i 
9 1920 1930 1940 1950 1960 1970 
| һе data are from Buros (1970). 
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Figure 3. Growth curves for ratios of references to instruments. The log- 
мао ев тайо of the number of references available to the number 
pr а copyrighted is plotted at the end of each five year nc 
E Ud and 1965. The solid line represents nonprojective devices nm 
am ine represents projective measures. The points for 1970 are crude 

rapolations. The data are from Buros (1970). 


dd articles declined and then rose very sharply as though 4 

eficit were being made up. Growth then resumed as if nothing 

had happened. 

D aaa on the long term trends in these data, we note that 
€ most rapid growth in new copyrights occurred in the nonp!®- 
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jective area between 1925 and 1935. The total supply of these in- 
struments then doubled about every 2.25 years! Between 1935 and 
1940 this rate of copyrighting declined and after 1940 the rate 
declined again to a relatively stable value that produces а doubling 
in the total supply of such instruments every 12.5 years. This rate 
was maintained from 1940 to 1965 and if it continues through 1970 
something like 510 nonprojective devices will have been copyrighted 
by the end of this year. The supply of copyrighted projective in- 
struments grew at a fairly stable rate from the early 1930's to 1955 
or possibly 1960, doubling every five to six years. The rate of growth 
then began to decline considerably. If the trend established between 
1960 and 1965 is maintained, there will still be no more than 95 
projective measures copyrighted by the end of that year. 

The growth of research papers on nonprojective and projective 
instruments from 1940 through 1965 is displayed in Figure 2. 
Throughout this period references to nonprojective measures grew 
exponentially at a quite rapid rate, doubling in number every 
seven years, References in the projective area accumulated at 
twice that rate prior to 1955; but since then their growth rate has 
declined to a doubling period of about 11.5 years. If the current 
rates of growth are maintained there should be some 13,000 refer- 
ences to nonprojective devices and some 8,500 references to projec- 
tive devices by the end of that year. 

While the number of copyrights and references both increase, 
the references grow at the faster rate and the ratio of references 
to instruments, therefore, tends to increase. The number of refer- 
ences per nonprojective measure seems to have doubled every 15 
years and may reach an average of 25.5 by the end of 1970. The 
number of references per projective measure shows considerably 
more year to year variation and may have been in decline since 
1955. Our estimate, however, is still an average of 93.6 papers per 


instrument by the end of 1970. ра 

Growth curves such as these are familiar to historians of 
science and technology (Price, 1961, 1962). They commonly fall 
into four phases: an initial period of growth spurts and false starts, 
a period of steady exponential gro h, a decline into merely linear 


growth, and finally either stagnation or renewal if new techniques 


or points of view revive the field and bring about a new period 
lity instruments are 


of exponential growth. Nonprojective persona 
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still at the second of these four stages, for all the relevant growth 
rates are still exponential. The projective measures seem well eme 
barked on the third leg of this four stage journey, for the rate of 
copyrighting and researching them has declined so steadily that 
there are now less than half the number of new instruments and 
publications that the earlier growth rates (prior to 1955) would 
have led one to expect. Unless new techniques or insights appear that 
ean revive this field, the whole projective movement may be mori- 
bund by 1980. The seriousness of this prospect becomes apparent 
when we realize that the label “projective test” covers a great 
variety of devices while the nonprojective instruments consist 
chiefly of questionnaires. What will happen to the field of per- 
sonality studies if, after a half century of work, the self-report 
questionnaire emerges as the only successful personality measuring 
device? 


Quantity and Quality 

Since concern about these quantitative issues would abate if the 
quality of personality measurement were clearly improving, it is 
worth examining a number of mechanisms by which increments in 
quality might be produced. Progress is sometimes said to stem 
from competition and the survival of the fittest, from research, and 
from the application of higher professional standards. The data in 
PTR may be used to evaluate each of these hypotheses. 


Competition and Progress 


The continuous arrival of new personality measures and rê 
search studies leads us to suppose that the older personality mea- 
sures are just as continually falling by the wayside, obsolescing con- 
siderable portions of the early literature as they go. This view gains 
credence from studies of the changing popularity of personality in- 
struments in clinics and business organizations. Sundberg (1961); 
for example, collected usage data in several sorts of clinical services 
in 1959 for comparison with similar data from Louttit and Browne 


+ Nonprojective devices need not consist chiefly of questionnaires. As Cat- 
tell and Warburton (1967) demonstrate, there ite nA as many types 
of tests in Cattell's Objective Analytie "Test Battery as there are in all o 
PTR. But this nonprojective battery was copyrighted in 1955, it has ас 
cumulated only 23 references, and it is now out of print. We understand that 
this state of affairs is temporary and hope to see further work done on t E 
mehsures for the date of this review highlight thoir strategic importance. 
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(1947) covering 1946 and 1935. Sundberg found that between 1935 
and 1946 there was a turnover rate of 60 per cent in the twenty 
most favored measures. Between 1946 and 1959 this rate was 38 
per cent. While these changes are dramatic, the figures confound 
changes in the perceived suitability of instruments for particular 
diagnostic goals—such as assessing brain damage—with changes 
in the diagnostic goals themselves. As Sundberg so clearly points 
out, changes of this latter sort prevailed throughout clinical 
psychology between 1935 and 1959. The rapid turnover of instru- 
ment preferences during this era is, therefore, no guarantee that 
direct competition between devices occurred or that progress in 
instrument design was being made. 

The tendency for instruments to go out of print might be a more 
nearly optimal index of competition and hence of technological 
progress in testing, We would expect, for example, that most of the 
recently copyrighted measures would still be in print but that those 
that were copyrighted at progressively earlier dates would have suf- 
fered progressively more attrition, so that by now nearly all the 
earlier devices would have disappeared from the market. The Text 
Index in PTR lists in- and out-of-print instruments in separate 
sections and it was easy to use this information to construct Table 
2. Column 1 of this table subdivides over 40 years of measurement 


TABLE 2 
Personality Instruments Copyrighted and Instruments Out-of-Print in 1970 
Test Copyrighting Out of Print Proportion | 
Interval ore in 1970 — Now Out of Print 
of Years Nonproj. Proj. Nonproj. Proj. Мопргој. Proj. 
a) (2) (3) (4) (5) (6) @) 
1966-68 35 4 1 0 03 00 
1961-65 68 11 1 0 .02 .00 
1956-60 18 20 10 1 13 .05 
1951-55 44 21 8 4 18 19 
1946-50 40 21 11 1 28 19 
1941-45 39 6 17 2 44 133 
1936-40 56 6 43 2 тї .33 
1931-35 35 0 25 0 ті 100 
1926-30 7 2 4 0 .57 .00 
Up to 1925 2 2 1 0 .50 .00 
Totals 404 93 121 13 .30 14 


Note.—Figures for 1966-1968 may be incomplete. By consecutive half decade intervals. Data 
abstracted from Buros (1970). 
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history into blocks of (usually) five years each. Columns 2 and 8 
record the incidence of nonprojective and projective instrument 
copyrights within each time block. Columns 4 and 5 report the 
number of these devices that are listed as out of print as of the 
publication of PTR. The final two columns give the ratios of out- 
of-print to total instruments within each block of time. 

These out-of-print ratios do behave in certain respects as though 
they measured obsolescence: They increase as one goes backward 
in time, at least until the middle 1930s. The index for projective 
measures is uniformly lower than for nonprojective measures, sug- 
gesting that the latter are more likely to go out of print. Since the 
rate of copyrighting nonprojective measures is still exponential, the 
continual arrival of new devices may indeed put the older ones 
out of business. 

As one looks back beyond the middle 1930s, however, the out- 
of-print ratios in the last two columns stop rising and begin to 
decline. Some very early instruments seem to have a competitive 
advantage over later ones, but this is not what conventional no- 
tions of obsolescence would lead one to expect. Further dificul- 
ties arise when one examines the references attached to the in- 
and out-of-print devices; the out-of-print versions seem to have 
less than their fair share. The 30 per cent of all nonprojective mea 
sures that have gone out of print account for only 7 per cent of 
the literature on nonprojective instruments (774 references out of 
10,947). The 14 per cent of all projective measures that are now out 
of print account for a little over 1 per cent of their literature (90 
out of 7,753 references). Looking at these data in another way, Wê 
find that while a few inventories such as Thurstone’s Personal 
ity Schedule or the Bogardus Social Distance Scales went out of 
print with over 50 references, the majority departed with far less: 
The nonprojective measures that are still in print average 36 refer- 
ences apiece whereas those that are out of print average only 64 
references and have a median of two! The projective devices in 
print average 95.8 references apiece but out-of-print projective 
measures average 6.9 references and have a median of 5. 

If competition implied progress, personality measures ought t0 
go out of print because they are dated by technically more advanced 
successors. An undetermined but probably large number of personal- 
ity measures may go out of print because they cannot compete with 


Oe eee 
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well entrenched predecessors. This may occur because good ideas 
are widely copied and because the competition is between instru- 
ments that are both original and sound and a host of inferior imi- 
tations that are gradually forced off the market. On the other hand, 
the demand for innovation in personality measurement may be so 
inelastic that modest improvements have little chance of accep- 
tance. In neither case does competition imply automatic progress 
and there is, therefore, room to suspect that the field of personality 
measurement may actually be in a state of technological stagnation. 


Research and Progress 


The sheer volume of research that is being done with personality 
instruments might seem to guarantee progress in design, at least 
until we glance at some data in a table that Buros provides on 
page xviii of his introduction to PTR. This table shows that the 
first six MMY’s recorded a total of 2,802 tests of all sorts which 
fell into 15 major categories. The category of personality measures 
contained 386 members or about 14 per cent of the total. The en- 
tire collection had a bibliography of 23,763 titles to which the 
category of personality measurement contributed 11,214, or al- 
most 50 per cent of the total. The next largest contributors were 
two measurement categories that are often absorbed into expanded 
definitions of personality. There were 290 intelligence tests with 
5,494 references and 365 vocational measures with 2,650 refer- 
ences. In marked contrast the tests in such academic areas as 
Business Education, English, Foreign Languages, Mathematics, 
Science, and Social Studies constituted a pool of 1058 instruments 
with 1194 references, which is only a little more than one refer- 
ence per test. Almost everyone would agree that academic tests 
are more valid as a group than the personality measures and yet 
research publications favor the latter by an enormous margin. There 
is, evidently, a great deal of difference between research on tests 
and research with tests and most of the references in the personality 
area probably fall into the latter category. 

Personality researchers may view their instruments as tools that 
are important for the kind of service they render rather than for the 
quality of that service. Zipf (1949) suggests that the most used mem- 
bers of any set of tools will tend to be versatile. There are 
at least two senses in which tools can be versatile, however. 


They can perform many different functions, like a jacknife, or they 
can perform one service that is involved in many different kinds 
of work, like а hammer. Since the number of scores provided by an 
instrument might index its jacknife-type versaülity, we asked 
whether the more multivariate instruments tended to accumulate 
references more rapidly. 

In testing this hypothesis we turned to а table provided by 
Buros on page xxiv of his introduction to PTR. This table gives 
the number of references for each of those 89 personality measures 
that have more than 25 references apiece. Some of the top meas. 
ures on this list could have accumulated many references merely 
because they have been in circulation since the 1930s. In order to 
correct for this the out-of-print devices were dropped and the 
references for each in-print instrument were divided by the number 
of years that elapsed between the year of its first copyright and 


r 
1970. These in-print instruments were then listed in descending | 
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order of their publication rates in Table 3. Information on the 
number of scores provided by each measure was then sought in 
the PTR Test Index. The data on projective measures proved 
incomplete; in many cases the number of scores was not even listed; 
perhaps because they are thought capable of measuring ап in- | 
definitely large number of traits. In any event it was possible to 
test the versatility hypothesis only on non-projective instruments. 
The 62 in-print, nonprojective measures were dichotomized at 
the median on the “references” variable to give 31 devices with 
more than 3.20 references per year and 31 with less, The “number 
of scores” variable was dichotomized as near its median as possible 
to give 35 devices with six or more scores and 27 with five scores 
or fewer. Twenty-two of the more frequently published measures 
had six or more scores. Chi Square for this fourfold table is 5.30 
and with a two-tailed test the associated probability is about 
025. The corresponding contingency coefficient is .29, the phi- 
coefficient is .25, and the tetrachoric correlation was estimated 
45. There appears to be a modest relationship between multi- 
variate versatility and popularity as a research instrument—8. 
least among that group of instruments that have accumulated 26 
publications or more. ‚ 
There are measures in Table 3 that are being vigorously 16 
searched even though they provide relatively few scores. Seve 
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of these appear to assess the "first factor" of the MMPI, the one 
that Edwards (1970) calls social desirability. This dimension 
probably displays our second sort of versality, for it is involved 
in many research problems especially when questionnaires are 
used to identify psychopathology. Almost all the symptoms of 
various sorts of psychopathology are socially undesirable and will 
be confessed only by those who are capable of making negative 
statements about themselves. Questionnaire measures of psycho- 
pathology, therefore, tend to produce a common factor even 
though the syndromes being measured may have very little in 
common. 

There is, of course, no necessary relationship between versatility 
and validity. One response dimension should not be used to index 
six or seven conceptually different syndromes and multivariate 
instruments that utilize a single measurement method (typically 
the questionnaire) probably provide less validity per trait than 
would an equally comprehensive battery of univariate scales in 
which each trait is measured by a method that is maximally 
&ppropriate to it. 


Standards and Progress 


Personality testers, particularly projective testers, are often 
urged to study professional test standards and apply them to 
their own wares. Buros reprints the current APA-AERA-NCME 
Standards for Educational and Psychological Tests and Manuals 
in PTR, but he also calls our attention to certain insufficiently 
appreciated pecularities in this document. It focuses on reporting 
the manual, not on instrument design. The Standards (and the 
earlier Technical Recommendations, 1954) assert that projective 
instruments pose special problems because they have both nomo- 
thetic and ideographie aspects and they urge that their authors 
report the nomothetie aspects fully in their manuals. Many authors 
may have accepted this advice as an invitation to force data from 
their instruments into something like a manual. The results, to 
judge from Buros’ reviewers, is often a laboriously intricate scor- 
ing system with statistically undesirable properties including low 
teliability. Other authors may try to capitalize on the supposedly 
ideographie aspects of their instruments by trying to show that 
experts can use the material to make interesting predictions. The 
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TABLE 3 
Personality Instruments with 25 or More References 


References 
Tests (and rankings) 

Minnesota Multiphasic Personality Inventory (1) 88.4 1 
Rorschach (1*) 76.5 
Edwards Personal Preference Schedule (2) 41.0 
‘Thematic Apperception Test (2*) 36.4 
California Psychological Inventory (3) 28.4 
Maudsley Personality Inventory (4) 24.5 

Sixteen Personality Factor Questionnaire (5) 18.0 

The Guilford-Zimmerman Temperament Survey (6) 14.5 H 
Bender-Gestalt Test (3*) 13 4 

Study of Values (7) 12.2 
Rosenzweig Picture-Frustration Study (4*) 11.9 15 
Machover Draw-A-Person Test (5*) 10.5 
Personality and Personal Illness Questionnaires (8) 9.7 16 
The Holtzman Ink Blot Technique (6*) 9.3 22 
The Personality Inventory (Bernreuter) (9) 8.7 
California Test of Personality (10) 7.8 15 
Eysenck Personality Inventory (11) 7.6 
Interpersonal Check List (12) 7.3 

Stern Environment Indexes (13) 6.5 
Omnibus Personality Inventory (14) 6.4 


Survey of Interpersonal Values (15) 

The Blacky Pictures (7*) 

The Adjective Check List (16.5) 

H-T-P (8*) 

The IPAT Anxiety Scale Questionnaire (16.5) 

Multiple Affect Adjective Check List (18) 

Szondi Test (9*) 

Inpatient Multidimensional Psychiatric Scale (19) 

College and University Environment Scales (20.5) 

Spiral Aftereffect Test (20.5) 

The Adjustment Inventory (Bell) (22) 

Mooney Problem Check List (23) 

Structured-Objective Rorschach Test (10*) 

Embedded Figures Test (24) 

Vineland Social Maturity Seale (25) 

Rotter Incomplete Sentences Blank (11*) 

Gordon Personal Profile (26) 

Cornell Medical Index-Health Questionnaire (27) 

The FIRO Scales (28) 

Stern Activities Index (29) 

Personal Orientation Inventory (30) 

Shipley-Institute of Living Scale for Measuring 
Intellectual Impairment (31) 

Stanford Hypnotic Susceptibility Scale (32) 

Goldstein-Scheerer Tests of Abstract and Concrete 
Thinking (33) 

The Hoffer-Osmund Diagnostic Test (34) 

The Guilford-Martin Inventory of Factors GAMIN (35) 

Kahn Test of Symbol Arrangement (12*) 

Jr.-Sr. High School Personality Questionnaire (36) 
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TABLE 3 (Continued) 


| 


Tests (and rankings) 


The IES Test (13*) 

Minnesota Counseling Inventory (37) 
Cornell Index (38) 

Vocational Preferences Inventory (39) 
Children's Apperception Test (14*) 

The Guilford-Martin Personnel Inventory (40) 
Lowenfeld Mosaic Test (15*) 

Myers-Briggs Type Indicator (41) 

STS Youth Inventory (42) 

Thurstone Temperament Schedule (43) 

An Inventory of Factors STDCR (44) 
Make A Picture Story (10*) 

Tennessee Self Concept Scale (45) 

Welsh Figures Preference Test (46) 

The Humm-Wadsworth Temperament Seale (47) 
Kent-Rosanoff Free Association Test (17*) 
Activity Vector Analysis (48) 

Interpersonal Diagnosis of Personality (18*) 
Gordon Personal Inventory (49) 
Memory-For-Designs Test (50) 

It Scale for Children (51) 

Concept Formation Test (Vigotsky) (52) 
A-S Reaction Study (53) 
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Babcock Test of Mental Efficiency (54) ш 
KD Proneness Scale and Check List (55) 6 
Social Intelligence Test (56) 6 
Kuder Preference Record-Personal (57) 1" 
The Empathy Test (58.5) 9 plus 
The Purdue Master Attitude Scales (58.5) 1* 
Attitude Interest, Analysis Test (60) 1" 
Security-Insecurity Inventory (61) 5 


Personal Adjustment Inventory (62) 
Note.—Numbers in parentheses following title give for nonpro- 

accumulated per year since first being copyrighted. Ranks were D were E 

jective and projective measures. Column two contains average number beer nie 

Column three contains numbers of scores per erent uten. marks denot 

this is especially ambiguous. All data taken from Buros (1 " 


* projective instrument. : 7 
** Number of scores not given in Personality Test Index in Buros 0070) 


results are usually опе more demonstration of the fallability of 
experts, especially when they are competing with a regression 
equation. ‘ ч 
Hindsight now suggests that the real problem is that some in- 
struments are composed of items whereas others are not (DuBois, 
1970). When independent, dichotomously scored units сап be 
grouped in various ways to form scales it is relatively easy to 
upgrade the test manual. One can provide more adequate norms, 
report more extensively on reliability, disclose new correlations 


230 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


between scales and external criteria. A stricter application of the 
current test standards should produce more of these benefits and 
even lead indirectly to some improvements in the instrument it- 
self. The author might decide to add items to some scales to bring 
their reliabilities up to the competition. 

Many projective measures do not possess items in the con- 
ventional sense. Complex patterns of behavior may be evoked, 
but the components are often experimentally or statistically inter- 
dependent and they may be diffcult to group into scales in ways 
that satisfy psychometric criteria. Improving an instrument of 
this type may require changes in its design. The solution can be аз 
simple as Holtzman’s conversion of the classical Rorschach into а 
format with one response per ink blot. Alternatively, one may 
have to experiment with the device, modifying the instructions 
and the stimuli, until the elicited behavior is well enough under- 
stood to recast the entire procedure into a format with greater 
psychometric utility. 

Since this sort of instrument-oriented research may receive Very 
little encouragement in any quarter, there is a good chance that 
very little of it is being done. There is no reason to suppose that 
stricter application of the manual-oriented Technical Recom- 
mendations and Standards will improve matters. It may instead 
accelerate the abandonment of a wide variety of formats—notably 
but not exclusively the projective techniques—in the mistaken idea 
that they are unsound when they are merely undeveloped. It may 
lead to the acceptance of questionnaires as psychologically sound 
when they are advanced only from the psychometric viewpoint. 


The Costs of Stagnation 


PTR presents personality measurement in a new and unexpected 
light: There is considerable activity in some quarters but in others 
there is evidence of impending stagnation. Inferior instruments 
are supposedly hazardous to test takers and some serious PI 
fessional thought, along with at least one Congressional investiga- 
tion, has been devoted to protecting their interests. Teachers and 
researchers in the area of personality studies have been less con- 
cerned about the possible costs of stagnation to themselves. The 
researchers probably suppose that personality measures are tools 
that may be used indefinitely without much concern about their 
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maintenance and improvement. By using them in substantive re- 
search they may even suppose that they are making improvements. 
The exact opposite may actually be true. 

Most psychologists use personality instruments to define con- 
structs, In order to learn more about anxiety, extroversion, or need 
achievement they administer measures of these traits and apply 
statistical procedures to the scores. The results are sometimes 
insignificant and the experimenter must then conclude that he mis- 
understood his construct or that he chose a poor measure of it 
(or both). In a field where type II errors are as frequent as they 
are said to be in personality research (Cohen, 1962) these doubts 
will accumulate over measures and over time. Since it is easier 
to change one’s ideas than one’s tools, subsequent researchers 
may try to design more “insightful” studies with the old instru- 
ments instead of producing better instruments. When thousands 
of problematic studies accumulate, as they have on the most 
popular measures, a few skeptics may try to sift the data to see 
what is wrong, but the information may then be lost in the general 
glut. At some point along the way there may be questions from 
allied social science disciplines as to whether people who talk 
about personality research are really capable of advancing our 
knowledge of human behavior. м 

While technological stagnation does pose a threat to their sci- 
entifie credibility, teachers and researchers in this area are by no 
means doomed. Their situation is serious but they possess means 
for improving it. They can extend and refine their analysis of 
personality construets in two directions. Given any personality 
trait or state they could ask why it might be more accurately 
assessed with one sort of measuring device rather that with another. 
They might then ask how socially important eri 
instruments can be provided out of the data of everyday life. 


teria for their 


The Utility of Oscar Buros 


An interest in problems of this sort should change one’s view 
of PTR from a doomsday book to à valuable aid. Buros (1970, 
page xxvii) asks his reviewers to g0 beyond the manual, to com- 
pare tests, to say which is best, io praise good work and to 
censure bad. Beyond this his instructions are open ended—even 


projective. The reviewers decide for themselves what excellence 
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is and they attack his issue from many angles. PTR is suggestive 
rich, but its full value may be realized only if one has some mei n 
of isolating information without being distracted by i Al 
material. 

Classification schemes can be used to underline design char 
acteristics that are common to diverse personality measures. 
such classifications have been suggested (Cattell, 1957, 
and Warburton, 1967, Guilford, 1967, Symonds, 1981). 
illustrative purposes we will sort tests into six families dependir 
on the nature of the stimulus presented to the subject and th 
nature of the response expected from him. This classification may 
be applied to most of the instruments in PTR and it will also 
accommodate many of the experimental measures in Cattell 
Warburton’s (1967) compendium, Objective Personality and Mot 
vation Tests. The scheme is illustrated in Table 4. 


An Index to Controversy 


“Process tracing" experiments postulate a series of stages 0 
processes that mediate between stimulus and response (Woodwol 
and Schlosberg, 1958). In some of the test families in Table | 
these processes seem open to inspection, for when a subject draw 
a picture (Family 1) or constructs a toy world (Family 2) һер 
duces a series of qualitatively distinctive responses that te min: 
in a finished product. The result may count as a single item, bU 
all of the subject’s behaviors сап be referred to the consecutive 
stages by which this item-response came into being. While М 
devices in Families 3 through 6 usually contain several items, tt 
behaviors appropriate to any one of these tend to be less ove 
In the ease of questionnaires we have, instead of an audible tr 
of associations, a silent, routinized internal switching that 
to “yes” or “no.” Hidden processes are likely to be poorly und 
stood and a potential source of controversy among compe! 
personality testers. PTR ean be read as an index to these € 
troversies and as a means of identifying some potentially imp 
tant substantive research problems that are buried in the meas 
urement literature. 

The perceptual response to instruments like the Rorscha 
(Family 3) occurs so rapidly that there is debate about the relati 
importance of perceptual styles (such as color dominance) 95 - 
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TABLE 4 
Classification of Personality Measures 
Btimulus Type of 
Material Response Examples 
Projective Measures 
1. Blank paper Drawings Figure drawing devices of all sorts: 


H-T-P, Eight Card Redrawing Testa; 
Drawing Changes Under Dis 


approval (T-102). 
2. Pieces or parts Construction of a Lowenfield Kaleidoblocks, Lowe- 
whole field Mosaic, Make A Picture A 


Picture World Test, Toy W 


3. Ambiguous “Imaginative” per- Holtzman Ink Blot Technique, 
visual patterns ception Howard Ink Blot Test, Rorschach, 


structured Dra 
(T-327), Autistic Projection (T-369). 
4 Pictures Verbal associations, Blacky Pictures, Children's Apper- 
Semantic units or ception Test, Thematic 
systems Test, Rosenzweig Picture-Frustra- 
tion Study; Brunswik's Faces (T-92), 


Picture Exploration (T-104). 
5. Semantic units Verbal associations Association Adjustment Sige 


Nonprojective Tests 
6. Semantic units Semantic units or Psychiatrists interview, Inf uer 
or systems systems Cronograph, most questio 

m = nad not exhaustive. Those 
wt ne eto Ba saa toa ero 
opposed to motivational states that predispose people rt sae 
certain kinds of content (see McCall's and Eron's reviews of the 
Rorschach in PTR). Behind this seemingly measurement oriented 
controversy there is our very real ignorance of the principles es 
lying the apperception of ambiguous" stimuli. Where questionnaires 
are concerned there has been much controversy about the relative 
importance of behavioral traits like anxiety or extroversion ва 
opposed to evaluative consisteneies and response sets in фендин 
ing the response to individual items (see Edward's, Frederiksen 9 
and Stricker's reviews in PTR). Responses to personality ques- 
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cesses that operate when people are asked to describe themselves 
in а questionnaire. The relative importance of these processes Hm. 
different people and on different items is surely an appropriate 
topic for scientific enquiry; very little is really known about the 
determinants of self descriptive behavior. 


An Index to Psychometric Developments 


gins noted that, 


“Interest in the free association experiment was sO wide-spread 
at the turn of the century that a full account would be almost 
indistinguishable from a general history of the psychology Of 
that era" 


and yet, 


ч... no systematic, large scale efforts have been made to 
develop the instrument as a ‘personality test’ in the current 
usage of these words since its inception in 1910." р 


Research on the scientific problems underlying test-taking Der 
havior is necessary and desirable, but it is not sufficient to 
guarantee improved measures. It is also necessary to produce 
an instrument with desirable psychometric properties. PTR maj 
be used to learn what psychometric problems exist in various areas 
and what progress is being made in resolving them. 

Apropos of Wiggins’ comment, we note that the Kent-Rosanc 
might be considered the first and most primitive member of 8 
family of association instruments (Family 5). Tendler (1930) 
devised one of the first sentence completion devices in order 
improve the association experiment by controlling and systema! 
cally varying the anticipatory aspect of the associative reacti 
and thus aim the test at specific topics and themes (RapaP® 
Gill, and Schafer, 1968). Other test authors with similar purpo 
have produced association tests with multiple choice formats 8 
as the Association Adjustment Inventory and Cattell’s T- 
Since these formats are more suited to personality testing than # 
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the original unfocused Kent-Rosanoff they may have received 
the lions share of the development and validation, even though 
they are less used in purely academic research on associative 
behavior. Analogous developments seem to occur in other test 
families. The primitive ancestor of the measures in Family 4 is 
probably the TAT. While this test has been much used in research 
on projective processes, it is not conceded to be valid as а per- 
sonality measure (see Eron and Jensen’s reviews in PTR). Re- 
viewers have been uniformly kinder to instruments that are 
focused on a restricted range of situations or thema such as the 
Blacky Test, the Rosenzweig PF and the Tomkins-Horn Picture 
Arrangement Test. 

In Family 3 the much researched Rorschach is psychometrically 
inferior to later instruments such as the Holtzman Ink Blot Test, 
The Structured Objective Rorschach, and several Cattell measures 
such as T-327 and T-369, but the later tests have been developed 
along two quite different lines. Some stress perceptual styles as 
determinants of the response while other stress content. A similar 
situation arises in Family 6 where the primitive ancestor is un- 
doubtedly the psychiatric interview. One line of development from 
the interview led to the Interaction Cronograph which emphasized 
the timing of the subject’s responses to the interviewer, while 
neglecting response content. This most interesting device is now, 
unfortunately, out of print. An alternative and more successful 
procedure retained the content of the interviewer's questions while 
eliminating the interviewer. This tactic yielded, of course, the 
questionnaire. 

In Family 1 the subject draws & picture and in Family 2 he 
constructs something from fixed material components. The most 
primitive versions of these devices ask for single pictures or con- 
structions, but the newer vehicles are asking for a series of pictures 
(The Eight Card Redrawing Test) ога series of constructions 
(Maps, The Picture World Test, T-343). Progress here seems to 
call for a transition from a single item to a multi-item format. 


An Index to Validity Studies 


While an interest in purely substantive problems can motivate 
important sorts of background research on personality measure- 
- ment, the actual labor of instrument development is likely to be 


236 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


| 


undertaken only when it promises to serve some useful purpose. | 
Its prospects probably depend in part on the type of validity it | 


manages to achieve. Educational tests which aim at content valid- 


ity will be valued when they sample skills that are much in | 


demand. Personality devices are usually expected to provide con- 
struct or criterion validity, and in either case there will be а 
tendency to judge a measure valuable in proportion as it predicts 
nontest, real world performances that are conceded to be important 
by society. These criteria are usually embedded in a larger social 
matrix and their value to society may be debated in ethical, eco- 
nomic, or political terms. In order to be useful to the tester, how- 
ever, these criteria must contain a substantial amount of personality 
variance and they must also possess purely psychometric virtues 
such as reliability. 

PTR can be read as a history of experience with personality 
scale validation, but in order to use it in this way one must have 
a method of extracting and organizing the relevant data. The test 
classification in Table 4 may not help much because the instru- 
ments in every family tend to be validated against the same 
criteria. Paramount among these criteria are indices of adjustment, 
of job performance, and of successful response to various thera- 
peutic, custodial, or educational treatments. These are classes of 
criteria, not individual variables, and in organizing them it will 
be helpful to pick one class and visualize its members in the 
context of some relevant social process. We will consider the 
various adjustment criteria as they are encountered by an indi- 
vidual with a severe behavior disorder. As the symptoms of such 
5 disorder develop, they increasingly disturb either the person who 
exhibits them or significant individuals in that person's environ- 
ment—family and friends, for example. The person himself шау 
seek psychiatric help or those closest to him may seek it in his 
behalf. The person’s difficulties are then “diagnosed” and he be- 
comes a "patient" with a nosological label. A course of treatment 
or custodial care is planned and upon its successful completion 
the former patient may resume something like his old position in 
society. 

Patient oriented criteria. Mental health criteria may be devel; 
oped at many points in this process. We shall speak of those applica 
able before referral as “person oriented” and those applicable after 
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referral as “patient oriented." The most popular patient oriented 
criteria have been psychiatric diagnoses or ratings, but there are 
signs that these criteria are falling out of favor. Some diagnostic 
categories (psychasthenia) have been scrapped even though meas- 
ures that are supposed to diagnose them continue to thrive. Some 
categories have displayed suspiciously little inter-rater reliability 
and others have had their construct validity as diseases seriously 
challenged (Szasz, 1961). Adjuncts to or substitutes for psychiatric 
diagnosis now seem to be under development. As Shaffer (Buros, 
1970, page 833) points out, inpatient rating schedules (Buros, 
1970, pages 54, 68, 121, 157, 169) offer a new approach to the 
criterion problem. Nurses or other psychiatric staff observe the 
patient’s ward behavior over a period of days or weeks and de- 
scribe it by endorsing standardized, factor analyzed, descriptive 
items. Devices of this sort сап probably provide accurate, content 
valid descriptions of some aspects of psychotic behavior. Per- 
sonality measures might be validated against them and they 
have the additional advantage of predicting some of the major 
costs of psychiatrie care such as response to chemotherapy, closed 
versus open ward status, and length of hospitalization. 

The decision to seek psychiatric assistance for one’s self has 
been much used as a criterion, particularly with educated middle- 
class groups. Some questionnaires and sentence completion tests 
claim sizable coefficients of concurrent validity against this crite- 
Tion, but this sort of claim has been disparaged by Shaffer 
(Buros, 1970, page 1232) who points out that voluntary patients 
tend to exaggerate their symptoms in order to document their 
pleas for help. The prediction of this criterion has nevertheless 
been of real interest to universities, military organizations, and 
other institutions that are committed to providing free health 
services and who are anxious to minimize costs from this quarter. 

Person oriented criteria. It is obviously desirable to develop 
indices of behavioral disturbance that are applicable to the early 
stages of the disorder as these are observed in everyday life. 
Many people will suppose that school teachers are ideally situated 
to provide criterion descriptions of children with behavior prob- 
lems because the teachers experience these problems as & cost- 
producing factor in their own class rooms. Since 1925 at least 
80 rating forms for teachers’ assessments of pupils have been 
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created. Half of these are now out of print, however, and none 
seem to have been extensively used as a criterion for personality 
test development or indeed (with two or three honorable exeep- 
tions) for research of any kind, Classroom behavior ratings pres- 
ent an unusually severe problem for criterion development and 
а glance at some of the reviews accorded these instruments in 
PTR will show why this is so. 

As Gambrill (Buros, 1970, page 304) points out, some rating 
schedules are designed "as service instruments to help teachers in 
understanding pupils rather than for research purposes.” If ser- 
vice to the teacher has been emphasized in advertising these in- 
struments, it may be for the reasons mentioned by Lundy (Buros, 
1970, page 793). He points out that completing such ratings 
requires so much clerical time and labor that teachers who have 
been requested to use them are likely to comply in only the most 
perfunctory way. This is particularly true when there are no 
facilities to which obstreperous children can be referred or when 
parents refuse to accept evaluations of their children that they 
did not initiate. Some reviewers fear that the use of ratings in the 
schools will result only in the attachment of labels like "mal- 
adjusted” or “emotionally handicapped” to the permanent school 
records of some of the children and that this will do far more harm 
than good, especially in view of the naivete about behavioral dis- 
orders that prevails among school teachers, administrators, and 
the general public. 

Very little seems to have been done on the development of 
adjustment, criteria in institutions outside the school. The institu- 
tion most affected by a serious psychiatrie disorder is likely to 
be the family but only one rating form has been developed to 
record information about the patient's disturbed behavior as seen 
by other members of his family. This device is very new and has 
not yet been reviewed (Buros, 1970, page 66). PTR contains 
almost no information about the development of maladjustment 
сепа for business organizations. One lone entry concerns a scale 
S Which subordinates rate college administrators or business 
e ا‎ e supposedly used only for the 

p е person being rated they are not sup- 
posed to be seen or utilized by the ratee's own supervisors. 

The neglect of the criterion problem. While personality testers 
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would benefit from the existence of adjustment criteria reflecting 
the costs of behavioral disorders in everyday life, the develop- 
mental research that is necessary to bring them into being will 
require the consideration of ethical, sociological, economie, and 
possibly political problems that transcend the psychometrist's 
traditional concerns for items, norms, and manuals. Buros’ re- 
viewers uniformly underestimate the importance of this work. While 
a validity coefficient must be evaluated against two reliabilities, 
most reviewers assess reliability for the test only and make no 
mention of comparable figures for the criteria. Only rarely do 
reviewers such as Hanawalt (Buros, 1970, pages 482 and 810) 
use criterion reliability data to edit validity studies so as to give 
a clearer picture of the true worth of the instrument. Here and 
there reviewers may mention the contamination of some criterion 
rating, but the subtler problems of rating validity are almost 
never discussed, When ratings are used as criteria, reliability 
is not enough. There are studies, for example by Freeberg (1967), 
which show that observer ratings of demonstrable invalidity may 
be quite homogenous and may show high correlations with other 
equally invalid observer ratings. 


Summary 

Although valuable enough as a guide to professional opinion 
about specific instruments, Buros’ Personality Tests and Reviews 
is even more important as an informant on the history and cur- 
rent status of the whole thrust of the personality assessment 
movement. Its services are available to anyone with а method of 
asking questions. We conclude that the nonprojective devices 
(which are chiefly questionnaires) are in a thriving state of & : 
but that projective measures are being copyrighted 
search much less frequently since 1955. New work on these latter 
devices could quite conceivably cease within the next ten years. 

Projective test critics have often called for technical improve- 
ment, but Buros’ book suggests that the pace of improvement 
is slow in every area of personality assessment. There is piss ad 
evidence that improvements in projective measures are EA 
accrue fronifcompetition and the survival of the fittest, from the 
accumulation of research evidence, or from the stricter irme 
of the current professional standards. Instead of being improved, 
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projective test ideas that once seemed highly creative may now be 
in the process of being abandoned. 

Since the projective rubric encompasses by far the greater 
variety of copyrighted personality devices psychologists may 
resist abandoning them in favor of the questionnaire. They will 
find PTR full of suggestions for reversing the trend. It is a record 
of controversies that betray our ignorance about the real processes 
that mediate test behavior. It is a compendium of descriptive 
data that may be used to order instruments into developmental 
series within larger format families. It is a source of information 
on the problems of validating personality measures against real- 
world criterion behaviors. An appropriate utilization of these re- 
sources might benefit not only personality assessment but the 
world of personality research, teaching, and service that has 
come to depend on copyrighted measuring devices so heavily. 
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COMPUTER PROGRAMS FOR TEST OBJECTIVE 
AND ITEM BANKING! 


WILLIAM P. GORTH лмо DWIGHT W. ALLEN 
The University of Massachusetts 
ARAM GRAYSON 
Stanford University 


Item banking as well as its logical predecessor objective bank- 
ing are becoming increasingly important in educational measure- 
ment. Several projects, both in the U. S. and England, have in- 
vestigated various forms of item banking. The motivation for these 
projects is usually the following: 


1. To make available to educators better test items for use in 
examinations in schools; 

2. To provide test items with known item characteristics so that 
results will be more valid and reliable than those based on 
locally developed items and that results from one setting can 
be compared with those from another setting; 

3. To make teachers more familiar, in genera 
notions of test construction including the classification of test 
items by categories which they measure, eg, behavioral 
objectives; 

4. To utilize test items, which are written by skilled authors, 
in many contexts without the added costs of writing new 
ones; 

and 


1, with modern 
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5. To provide a basis for better decision-making regarding the _ 
placement and instructional treatment of students which will 
minimize losses in students’ time and effort. 


Objective and item banking require several operations including 
stocking of the bank, retrieving information from the bank, and 
using the retrieved information in a variety of testing situations, 
Each of these operations could be used to characterize existing 
or future objective and item banks. Stocking the bank consists 
of writing or compiling objectives and items which have been 
classified by content and characterized by items statistics. Re- 
trieving from the bank consists of finding the objectives and items 
which are appropriate to the purpose for which they will be used. 
Using materials from the bank consists of diagnostic testing, 
placement testing, criterion-referenced testing within a course, — 
pretesting for the different instructional treatments, or testing 
on a longitudinal basis using item sampling. 

Existing efforts in objective and item banking may be chars - 
acterized by their purpose and operation. One of the most pubs 
licized efforts is the Instructional Objective Exchange, IOX (Pop- 
ham, 1970). IOX is an attempt to make available to teachers 
instructional objectives, with a grade level, content, and taxonomi- 
cal classification of objectives, and sample test items. These ma- 
terials are made available in the form of mimeographed booklets 
for specific subject areas and grade levels. The materials are nob 
distributed in a form that can be used directly, i.e., the objectives - 
are not printed in the form that could be transferred to a specific | 
school situation, test items are not numerous enough or appro- 
priately formated to constitute a test. IOX is not directed toward 
immediate implementation of objectives or testing programs but 
more as a guide toward development of locally based objectives 
and items. | 

A second major effort would be the Computer-Based Test De- 
velopment Center, COMBAT (Walter, 1970). The major purpose 
of COMBAT is to make а large number of teacher written 
questions available to classroom teachers for their classroom 
testing. The classification for test items is by key word and the 
item statistics are not measured. The storage and retrieval of the 
items is done by computer. They can be printed immediately in 
form which can be used as a classroom test. The computer printing 
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сап be done on masters which duplicate more copies. The testing 
materials are designed for usual classroom testing. 

A more sophisticated bank was produced at the National Foun- 
dation for Edueational Research in England and Wales because 
extensive information about item characteristics was available. 
The work of the Foundation is described by Wood and Skurnik 
(1969). The item bank includes items which can be used by 
school-based examiners to determine the score of students in the 
certification of secondary education in England and Wales in 
mathematics. Extensive work went into the development of the 
item bank. Items were classified by task. They were pretested so 
that their item characteristics were known. The storage and 
retrieval of items was from a card file. 

Another effort in objective and item banking is the focus of this 
paper. The banking system has been developed by the Project 
for Comprehensive Achievement Monitoring, CAM (Gorth, 1968). 
CAM has developed a model of evaluation useful in curriculum 
evaluation and classroom management. The model consists of 
longitudinal testing, using item sampling, of the specific behavioral 
objectives for a course. In order to support the testing activity, 
which uses a large number of test forms, and therefore, a large 
number of test items, computer programs were developed to stream- 
line test development. 

All of the items in the CAM item bank have been classified in at 
least three dimensions: their content, their taxonimical level, and 
the sequence in which they are taught in the typical school course. 
The relationships between items and objectives are referenced and 
item analysis information at pretest, posttest, and retention time 
intervals is available, Both the objectives and the test items are 
stored on magnetic tape. The classification of items and their 
relation to the objective is also recorded in the computerized item 
bank. The objectives and items may be selected on a preliminary 
basis for perusal by individual teachers or for establishing а 
bank consisting of a subset of the total objectives and items. After 
final selection and allocation to test forms, the tests can be printed 
in an easily duplicatable format. The answer keys for the tests are 
both printed as well as punched into computer cards for quick 
analysis by other programs available from CAM (Gorth, Grayson, 
and Lindeman, 1969; Gorth, Grayson, and Stroud, 1969; Gorth, 
Grayson, Popejoy, and Stroud, 1969). 


М8 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
The CAM Computer Programs 
Subsystem 1 : Storaging and Editing of Objectives and Items 


Input. The input into the computer program for storing items 
and objectives consists of three different types of data cards. The 
first type of data card contains the identification number of the 
objective, its classification as to subject and grade level, and its 
text. The text is keypunched onto computer cards in the format in 
which it will be later printed. The second type of data cards con- 
sists of test items keypunched onto computer cards with the text 
of the test question formated so that it may be prined out directly. 
Each question may have as many as nine alternatives, and the 
correct alternative can be indicated. Each question is also permitted 
to have explanatory notes to the examiner which specify certain 
diagrams, maps, figures, or physical objects which the student 
will be given at the time of the test. The third type of data cards 
consists of the classification scheme for the test item. Each test 
item is classified by subject and grade level as well as by other 
content and psychological classifications and item statistics. The 
classification information is limited to 15 distinct categories. 
Modification or additions may be made to the classifiers and the 
text of the item after the item bank has been initially developed. 
The capability of modifying the classifiers would allow additional 
item analysis information to be added after the item bank was 
created. The item bank resides on magnetic computer tape and 
additional objectives and items can be added or deleted as the 
Occasion requires. 

Output. The output from the computer program is a magnetic 
tape and its printed listing. A series of error messages is available 
to inform the user of obvious errors in the processing of data. 
Dus Which are in error, will not be recorded in the bank. Thus, 
possible inappropriate information in the item bank is eliminated. 


Subsystem 2: Preliminary Selection of Objectives and Items 


Input. The second computer program selects test objectives and 
Mese associated test items by the classifiers stored with the test 
items. The items are chosen by an internal compiler which allows 
specific values of any classifier or any combination of classifiers, 
e.g., content area, taxonomic level, and item characteristics, to Pê 
used as selection criteria for items and objectives. The desire 
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criteria are read into the computer program. The program reads the 
data bank tape and selects the items by the criteria specified. 

Output. The output from this stage of the item banking program 
consists of two parts. The first part is a printout of the objectives 
on one side of a page of computer output and of the associated 
test items on the other side. If there are more items than one for 
an objective, they are all printed in the same part of the output 
one after another. The objective is printed only once next to the 
first item in the list of items associated with it. Each item is 
printed out in the format which would be used on a test. The 
correct answer to an item is indicated when it is printed out. Its 
identification number in the item bank is also printed out for later 
Specification. 

The second form of output is a magnetic tape which is а sum- 
mary of the items selected for this set of tests at the first stage. 
These items are recorded on a second data tape so that they will 
not have to be relocated in the master objective and item bank. 
If the bank increases in size to the level of fifty or one hundred 
thousand test items, a preselection or preevaluation of the ap- 
propriateness of items by teachers must be made, but the expense of 
searching the data tape for the preliminary selection of items 
should not be duplicated. Therefore, the subset of items and 
objectives is recorded on the intermediate tape which would be much 
shorter than the original tape and tailored to the specific needs 
of the teacher, The subset may also serve as an objective and item 
bank tailored to local needs and can be distributed for local use. 


Subsystem 3: Objective and Test Printing 


Input. The magnetic computer tape (containing the items and 
objectives selected by individual teachers for specific testing situa- 
tions) which was written at the preceding stage in the item banking 
sequence is used as input for the final stage in the item banking 
Sequence. Also the final selection and arrangement of items on test 
forms are specified on data cards. ' 1 

Output. The output from the third stage of the item banking 
Procedure consists of tests printed in а form which can be im- 
mediately duplicated and administered to examinees. The tests 
may be printed on duplication masters or directly on multi-part, 
computer output paper. Each page of the test is labeled by the 


School, the course, the test form, and the page number. 
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In addition to the tests which are printed in a form which 
be used directly, each of the objectives associated with the test 
printed out in а form which also can be directly duplicated and _ 
distributed to examinees or students in a course. The objectives 
are numbered according to the pattern and sequence whieh has 
been chosen by the teacher to fit his curriculum organization. — 
An answer key for each test is also provided. The answer key is 
printed and punehed in a format which is appropriate for the 
analysis programs developed by Project CAM (Gorth, 1968). 
Each answer key contains all the information concerning the 
classification of each item on the test. The alternatives to the 
items are randomized by the computer program before they ar 
printed out in the final version of the test. x 

Additional information about the programs and their availability: 
may be obtained from William Gorth, Director, Project 
School of Education, The University of Massachusetts, А 
Massachusetts 01002. 
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INTERACTIONS AMONG GROUP REGRESSIONS: 
TESTING HOMOGENEITY OF GROUP REGRESSIONS 
AND PLOTTING REGIONS OF SIGNIFICANCE 


GARY D. BORICH 


Institute for Child Study 
Indiana University 


ВЕСЕХТ interest in aptitude-treatment, research has revitalized 
procedures for determining interactions among group regressions. 
Among these procedures is the homogeneity of group regressions 
test and the Johnson-Neyman technique. The homogeneity of 
group regressions model (Walker and Lev, 1953; Edwards, 1968) 
tests the hypothesis that regression slopes are equal across treat- 
ments, while the Johnson-Neyman technique (1936) determines 
regions of significance and nonsignificance when the equal slopes 
hypothesis is rejected. 

Two programs have been reported which analyze aptitude- 
treatment interactions; both, however, are limited in scope. Ter- 
tanova (1970) has programed an F-test for the homogeneity of 
group regressions, but because group interactions were not of pri- 
mary interest, the program does not determine regions of significance. 
A second programming effort by Carroll and Wilson (1970) has 
produced a program which determines regions of significance with- 
out first testing for homogeneity of group regressions. Their pro- 
gram determines regions of significance for the case in which there 
are two groups and two predictor variables but does not plot the 
data along the regression lines to indicate where such regions are 
meaningful. Although one option to the Carroll and Wilson pro- 
gram is that data input may be in the form of means, standard 
deviations, and correlations, it is important to note that the 
Johnson-Neyman technique assumes linearity of regression as well 
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as significant correlations between predictor and criterion. There 
fore, correlations and scatterplots should be computed before the 
Carroll and Wilson program is used. 


Program Deseription 

The program combines the essentials of the Terranova and the 
Carroll and Wilson programs for the case in which there are two 
groups, one predictor variable, and one criterion variable. The 
program plots data points, regression lines, and region(s) of sig- 
nificance, As suggested by Abelson (1953), the homogeneity of 
regression slopes test is performed, after which regions of signif- 
icance are determined, if applicable. 

Input to the program consists of (a) a selected probability 
level for the Johnson-Neyman test, (b) slope and constant for the 
treatment groups, (с) format for each group, (d) N's for each group, 
and (e) data cards, with criterion first and predictor second. 

The program computes and prints: 


1. F-test for homogeneity of regressions, 
and if applicable: 

2. Point at which regression lines intersect. 

3. Lower boundary of significance. 

4. Upper boundary of significance. 

The program plots: 

1. Seattergram for two treatments with data points for each 
treatment coded differently 

2. Regression lines among data points. 

3. Region(s) of significance with boundary points plotted 
through data. 


The abscissa is automatically sealed to include all aptitude 
values, and the ordinate is automatically scaled with predicted and 
obtained eriterion values. Sample problem solutions and plots ar 
available from the author 


Summary 
The Johnson-Neyman technique assumes that regression lines fot 
each treatment are linear and slopes unequal. A scatterplot for the 
linearity of regression lines assumption and an F-test for the 
more restrictive assumption of homogeneity of regression slope 
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are provided by the program with a plot of the region(s) of signif- 
kance. The program provides all necessary calculations for the 
aptitude-treatment investigation in which there are two groups, 
ene aptitude, and one criterion. 
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A FORTRAN PROGRAM FOR THE ANALYSIS ОЕ 
LINEAR COMPOSITE VARIANCE 


JOHN A. CREAGER 
American Council on. Education 


WHEN regression, discriminant, and canonical models are ap- 
plied to empirical data, the result is one or more linear composites 
of variables, Interest in such composites seldom lies solely in 
maximizing prediction, but also in some insight into the inter- 
relations among the variables in their roles of contributing to that 
prediction. Typically, the resulting linear composites are examined 
for what is involved algebraically in the prediction system, and for 
what is involved empirically in the research context in which the 
data were obtained. Both informal procedures, €£j “eyeballing 
the weights,” and more sophisticated techniques, involving the 
terms of the standard formula for linear composite variance, partial 
or part correlations, or analysis of covariance, are unsatisfactory 
because they fail to cope adequately with the multicollinearity 
of the system. In order to cope with multicollinearity in the 
analysis of a prediction system, it is necessary “to grasp the 
nettle” and analyze the role of the multicollinearity, itself. Other- 
Wise, we shall continue to be confounded in our judgments of the 
Telative import of the variables defining prediction composites. 


A procedure for accomplishing such analysis of linear com- 
llustrated for the special 


Posites has been proposed, and has been i 

case of regression composites, by Creager and Boruch (1969). 
Subsequent development has shown the procedure to be completely 
general and therefore applicable to the analysis of left and right 
composites associated with one or more canonical roots. The basis 
of the procedure is the determination of loadings for the linear 


Composites on factors defined by complete orthogonal factor analy- 
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sis of the correlation matrix used to develop the composites. 
Squared loadings for the composites are completely independent and 
additive portions of the total linear composite variance. To be of ' 
practical value, the orthogonal factors must have substantive 
meaning. 

Program COMPVAR implements the Creager-Boruch procedure 
for orthogonal analysis of linear composite variance, given the 
resulis of a complete orthogonal factor analysis of a prediction 
system. Variable formatting permits one to take the punched out- 
put from a factor program, e.g., Restricted Maximum Likelihood 
Factor Analysis (Joreskog and Gruvaeus, 1967) or VARIMAX 
(Kaiser, 1959) and obtain loadings on one to six linear com- 
posites, based on the set of variables factored. This flexibility per- 
mits analysis of several regression composites using the same 
predictors but varying criteria, up to six discriminant functions, 
or the left and right composites from three canonicals. Com- 
posites are assumed to be defined in terms of standard scores in 
the components. 

The program is written in FORTRAN IV, and includes two 
subroutines: MATMLY (matrix multiplication) and RESYMA, 
which reads in and prints the correlation matrix. I/O is entirely 
card input and on-line, printed output. 


Input 


Input will be described in terms of the data deck structure. 
1. Problem Card in 20A4 permits user to label his run. 
2. Parameter Card in 212,11: 


NVAR, the number of variables, up to 50. 
NCOM, the number of common factors, up to 50. 

hs : NLIN, the number of composites, up to б. 
- lwo variable format cards in 20A4, the first designating format 


for common factor loadings, the second designating format for 
vectors of squared uniqueness loadings and of composite weights. 
4. A third variable format card in 18A4 designating format = 
reading in the lower triangular correlation matrix in subroutine 
RESYMA. 

5. The correlation matrix. Hach row is punched from the left uP 
to, but excluding the diagonal, the rest of the card being 16 
blank. Where the number of variables and format require, а slash 
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may be used in the variable format and continuation cards used 
for each row of the matrix. 

6. Cards containing factor loadings. There will be one set of cards 
for each row (ie., variable of the factor matrix). Each set of 
cards will contain the common factor loadings for that variable. 

7. Sufficient cards in format to read in a 1 x NVAR vector of 
squared uniqueness loadings. The program converts this vector 
to diagonal matrix form. 

8. Cards containing the composite weights. There will be one card 
(or set of cards) for each composite, reading in a vector (1 X 
NVAR) of weights. 


Output 


The print-out exhibits the following: 
1. Contents of the problem card, parameter card, and three variable 
format cards read in. 
2. The input correlation matrix in full symmetric form with unit 
diagonals. 
3. The common factor matrix in usual form. 
4. A column vector of squared uniqueness loadings. 
5. The transpose (NVAR X NLIN) of the weights matrix. Thus, 
the weights for a given composite are found as a column vector 
in this matrix. 
6. The variance-covariance matrix of raw score composites of 
standard scores, WRW’. 
7. A column vector of composite standard deviations, the square 
Toot of diagonal values in the previous matrix. These values are 
explicitly required for computations in the Creager-Boruch pro- 
cedure. 
8. The NLIN x NCOM matrix of estimated loadings for com- 
Posites on the common factors. For a given composite, the loadings 
are found as a row vector in this matrix. 
9. The NVAR x NLIN matrix of estimated loadings for com- 
Posites on the unique factors. These values are not squared. For a 
given composite, the loadings are found as a column vector in 
this matrix, 
10. The variance-covariance matrix of standard score composites 
(ie, composite intercorrelation matrix) developed from multiply- 
ing factor loadings (both common and unique) for composites by 
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its transpose. Diagonal values may fall short of unity by an 
amount dependent upon the completeness of the factoring of the 
original correlation matrix. The square root of these values is com- 
puted and printed as a column vector of standard deviations of 
these composites. 

11. The NLIN х NCOM matrix of squared loadings for com- 
posites on common factors. 

12. The NVAR х NLIN matrix of squared loadings for com- 
posites on unique factors. 


Comments 


Program COMPVAR has been debugged on the XDS Sigma 
Five computer, using the hypothetical example problem in the 
original Creager-Boruch paper, and with real data from an 
achievement study (Jones, 1963).! The latter involved four pre- 
dictors and two criteria, and developed two canonicals (four 
composites) and multiples for each criterion. All six composites 
were analyzed in a single pass of the program requiring 1.17 min- 
utes including compilation. Actual execution time was .2 minute 
Liberal use of comment cards and judicious selection of program 
parameter names make the program easy to read and follow. 

The exhaustive factoring of the correlation matrix is critical for 
complete account of system variance. One should, in fact, define 
composites using the same correlation matrix as the one actually 
factored, le, actually reproduced by the factor solution. If one В 
un an estimation procedure like maximum likelihood for fact- 
oring, he will get the population estimate of the correlation matrix 
and can use this in the production and analysis of composites (€8 
regression, discriminant, or canonical). If one is working entirely 
irom the sample data, an algebraic factor solution such 85 
VARIMAX will give a reasonable solution, provided sufficient 
principal components have been rotated to obtain a clean separatio 
of common and unique factors. Serious consideration should also be 
given to prior correction of the correlation matrix for attenuation. 


Availability 


Copies of the Source program deck, sample print-out, and program 
documentation may be obtained for the nominal cost of repreductio? 


=L 
+ Dr. Jones kindly supplied computer printout of his canonical analysis. 


JOHN А. CREAGER 259 


and mailing. Requests should be addressed to the author, Office of 
Research, American Couneil on Education, One Dupont Circle, 
Washington, D.C. 20036. 
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FACTOR SIMILARITY 


CARMELO TERRANOVA 
Educational Research Council of America, Cleveland 


AN intuitively reasonable measure of attitude change, when using 
a semantic differential instrument, is the degree of incongruity be- 
tween the factors extracted from similar concepts of pre- and post- 
administrations. The wider use of the semantic differential as а 
measurement instrument for attitude change has been somewhat re- 
stricted by the unwieldly amount of data generated and by the rela- 
tive inaccessibility of an incongruity measure. Harmon (1960) and 
Tucker, Koopman, and Linn (1969) described a coefficient of con- 
gruence that measures the degree of similarity between a factor of 
one matrix and a factor of another matrix. 


Description of Program 

The program was designed to compare factors from similar con- 
cept matrices (person by scale) obtained at different times. It also 
may be used to compare factors of different concepts. It allows the 
Utilization of all variables (scales) in determining congruence be- 
tween factors rather than by comparing factors by means of a subset 
of variables (i.e., activity, potency, or evaluation). 

Input consists of (a) the number of concept pairs being com- 
pared, (b) the identification numbers of the concepts, (c) the num- 
ber of factors in each concept, (d) the number of variables com- 
prising each factor, and (e) the factor loadings of the previously 
rotated solution. The following is computed and printed for each pair 
of factors: (a) the sum over the variables of the products of the 
factor loadings, and (b) the coefficient of congruence. 

Discussion 

The usefulness of the semantic differential as а measurement in- 

strument for attitude change may be increased by this procedure— 
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that is, by computing indices of factor similarity and su 
interpreting the degree of dissimilarity as evidence of а 
change; conversely, the degree of similarity or congruence may 
indicative of the stability of attitudes. The computation of the 
efficient of congruence provided by this program ought to aid 
investigators who would like to use this intuitively pleasing m 
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EIGENVALUES AND VECTORS OF LARGE 
MATRICES ON THE IBM-1130 


JOHN R. HOWELL лхо SHARON L. CREWS 
Virginia Commonwealth University 


This FORTRAN language program caleulates the eigenvalues and 
eigenvectors of (or “diagonalizes”) a real symmetric N x N matrix 
by Jacobi's method as described by Ralston (1965). A slow version 
(3.6 microsecond) IBM-1130 computer with disk and 16K words of 
соге storage was used. Eigenvalues and vectors, of course, are re- 
quired in а number of multivariate statistical methods including 
principal components and factor analysis. 

A straightforward application of the Jacobi method with the 
eigenvalues and vectors being developed in two square arrays in core 
revealed that a matrix up to size N = 50 could be handled. The 
computer time required for N(N — 1)/2 plane rotations for both 
arrays (for N = 50) was 20 minutes. 


Diagonalization of Larger Matrices Using Disk 
It was felt that one should be able to use the large storage capacity 
of disk in order to diagonalize matrices larger than size 50. ' 
Assuming that a real, symmetric, N X N matrix to be diagonalized 
is stored in File 1 on the disk and an N X N identity matrix 18 
stored there on File 2, one proceeds as follows: 


1. First, read the matrix from File 1 into core and perform N (N 
—1)/2 plane rotations. Store the N (N — 1)/2 sine and cosine 
pairs used for these rotations on File 3 on disk. 

2. Now store this partially diagonalized matrix back on File 1 
and move the identity matrix (later the partially developed 
eigenvectors) from File 2 on disk to the same array in core 
that was used in step 1. 
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3. Using the sine and cosine pairs that were just stored on File 3 
on disk, perform the same N(N — 1)/2 plane rotations on the 
identity matrix that were performed on the matrix being 
diagonalized. 

4. Now move these partially developed eigenvectors to File 2 on 
disk and repeat the entire process (which may be called one 
cycle) a specified number of times usually about six. If fur- 
ther iterations are required for greater accuracy the program 
can be executed again and the process continued a specified 
number of times. This ean be done because the partially de- 
veloped eigenvalues and vectors are stored permanently in 
disk files. 


Tactics 

The table below shows the IBM-1130 computer times required 
for one cycle for matrices from size 50 to size 80, the largest that can 
be handled by this approach on an IBM-1130 computer (with 16K 
words of core storage). Any given matrix between these sizes can be 
stored on disk during any working day. As soon as a large block of 
computing time is available, such as nights, weekends, or holidays, 
the program can be executed, using the above procedure. The com- 
puter can be left unattended during this time, since there is minimal 
print-out (and thus, little danger of a paper jam) until the end of 
computations. The average absolute value of this sine for the N (N — 
1)/2 plane rotations is printed. This value should approach zero as 
diagonalization is neared. Finally, the diagonalized matrix of eigen- 
values and the matrix of eigenvectors are printed. 


Use of Results 
When one is satisfied that the required eigenvalues and vectors 
have been obtained with sufficient accuracy, any existing main pro- 


TABLE 1 
Cycle Times for Matrix Sizes 


TNT Cycle Time 
Matrix Size (Minutes) = 
50 
60 m 
70 101 
80 151 


Note.— These points plot as a straight line on semi-logarithmic paper. 


| 
| 
| 
| 
| 
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gram that calls a matrix diagonalizing subroutine can be modified 
to read the required numbers from disk. 


Availability 
Program listings can be obtained by writing to John R. Howell, 


Department of Biometry, Virginia Commonwealth University, Box 
832, МСУ Station, Richmond, Virginia 23219. 
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A COMPUTER PROGRAM FOR ESTIMATING 
RELATIVE SEQUENTIAL CONSTRAINT 


WILLIAM B. RUDOLPH 
Iowa State University 
ROBERT B. KANE 

Purdue University 


Tux inception of information theory (Shannon, 1948; Wiener, 
1948) gave researchers an additional tool to study language. How- 
ever, the ubiquitous enigma in the application of information theory 
concepts to language analysis is the determination of entropy and 
the subsequent redundancy. Garner and Carson (1960) separated 
redundancy into two parts; distributional constraint and sequential 
constraint. Binder and Wolin (1964) proved sequential constraint 
was equal to the multiple contingent uncertainty. À model formu- 
lated by Newman and Gerstman (1952) may be adapted to estimate 
the multiple contingent uncertainty and, consequently, the relative 
Sequential constraint. Briefly, the constraint imposed on the erite- 
Tion variable (symbol being predieted) by each of the predietor 
variables (preceding m symbols where m is a positive integer) is 
determined, These constraints are then summed resulting in an esti- 
mate of sequential constraint. 

Carterette and Jones (1963) computed relative sequential con- 
straints for a variety of childrens books as well as for biblical 
Dassages and adult literature, all written in English. Their program 
accommodated 28 distinct characters (26 letters, end of word, end 
of sentence). 

The study of relative sequential constraints of technical English 
Tequires a program which can accommodate many more than 28 
distinct characters because of the heavy use of nonalphabetic 
Symbols in technical discourse. 
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À program to compute relative sequential constraint for passages 
containing up to 126 distinct characters was created. Textual mate- 
rial is first encoded into machine characters. Then contingency 
tables are constructed which show the frequency with which each 
symbol follows every other symbol immediately and at distances 
of 2,3, 4, - --, 120 characters. Finally computations resulting in esti- 
mates of the relative sequential constraint are executed. 

The user may direct the machine to cut off contingency table 
construction at any stage between 2 and 120 that he desires. Used 
on а CDC 6500 computer the program is quite efficient. To illus- 
trate, а 5,000 symbol passage containing 51 distinct characters was 
analyzed in approximately 16 seconds whereas a 113,097 symbol 
passage containing 78 distinet characters was analyzed in approxi- 
mately 367 seconds. These analyses computed sequential constraint 
for characters separated by a maximum distance of 16. 
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AN ALTERATION OF PROGRAM U TEST TO 

DETERMINE THE DIRECTION OF GROUP 

DIFFERENCES FOR THE MANN-WHITNEY 
U TEST 


STEVEN M. JUNG лхо DEWEY LIPE 
American Institutes for Research, Palo Alto 


THOMAS J. QUIRK 
Educational Testing Service, Princeton — 


Охе of the most useful, and most used, of nonparametrie proce- 
dures is the Mann-Whitney U test (Siegel, 1956; Hays, 1963). The 
U test is used as an alternative to the parametric t test in assigning 
& probability statement to the differences between two independent 
Samples. Probability in this case is a function of the “smallness” of 
U, and significance is assigned on the basis of U values which are 
equal to or less than tabled values found in Siegel. Alternatively, in 
tases where sample group sizes are large, & 2 transformation may 
made, allowing direct probability estimation from the unit nor- 
mal distribution. 

The calculations necessary to compute U and associated 2 values 
are simplified considerably by the use of three FORTRAN sub- 
“Toutines described in the IBM Scientific Subroutine Package (IBM, 
Undated). These subroutines (RANK, TIE, and UTEST), when 
called by a suitable user-constructed main program, perform the 
. Operations of ranking observations in the two sample groups, cal- 
. culating the sums of ranks corrected for ties, and computing the 
Values of U and z (for suitably large samples). The user can em- 
Ploy his own ingenuity in the construction of his main program, but 
all that are required are the basic 1/0 and subroutine call state- 
Tents, 

А deceptive and subtle problem in this procedure is that of deter- 
. Mining the direction of obtained differences. One common practice 
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is to incorporate into the body of the main Mann-Whitney U pro- 
gram some routine for caleulating descriptive statistics on the depen- 
dent variables for the two sample groups. Or, alternatively, the 
user may run his data through some available descriptive statistics 
program prior to the application of the U test program. In either 
case, the output usually contains such statistics as the mean or me- 
dian of the dependent measures for each sample group. It is falla- 
cious, however, to assume that these statistics are adequate to indi- 
cate the direction of obtained sample differences, especially in 
situations which have called for the use of a nonparametric test. 

In Table 1 a set of contrived data illustrates a situation in which 
a measure of central tendency shows Group 2 to be the “larger” of 
two groups, whereas a measure of the relative rank ordering of the 
scores show Group 1 to be greater. 


TABLE 1 
Mean and Rank Order Values of Two Sets of Contrived Data 
Group 1 (N = 5) Group 2 (N = 6 
Seores Ranks Scores Ran 
0 1 49 4.5 
50 9.5 49 4.5 
50 9.5 49 4.5 
50 9.5 49 4.5 
50 9.5 49 4.5 
49 4.5 
УХ, = 200 YR, = 39 DX. = 204 YR = 27 
1= 40 X= 49 


This situation is obviously more likely to occur when one of the 
two distributions being compared is highly skewed. Yet it is just such 
а situation which calls for the application of the non-parametric 
Mann-Whitney U test in preference to a parametric test such as £ 
which makes use of mean differences. Siegel’s (1956) exposition, 
while lucid in other details, sheds little light directly on the problem 
of determining the direction of obtained differences, Intuitively and 
computationally, however, the solution is straightforward. 


When applying the sum of ranks method for calculating U, two 
formulas may be used: 


Ui = nM; + ne tD 9 В, (1) 
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or, equivalently: 


1 
О, = nha + mm + 0 —R, (2) 
where 
nı = number of cases in sample group 1. 
л, = number of cases in sample group 2. 
R, = sum of ranks assigned to group 1. 
R, = sum of ranks assigned to group 2. 


and where the lowest score is given a rank of 1 and the highest is 
given a rank of ny + тә. 

Formulas (1) and (2) generally yield different values; however, 
only the smaller of these values is called U and represents the basis 
for the tables found in Siegel (1956). The larger value is called U’. 
If only the first formula is used, the transformation 


U, = пт — U; (3) 


may be used to determine the value of U. As it happens, the latter 
is the procedure which is followed in the Scientific Subroutine 
Package (SSP) subroutine UTEST. UTEST always returns to the 
main program the value U which is the smaller of two possible 
values, U, and Us. This is correct procedure. However, the investi- 
gator is forced to use other data to determine which group produced 
this U value and was, hence, stochastically the larger. ) 
An easy modification may be made to the UTEST subroutine 


in order to return to the main program values which can be used to 


determine the direction of group differences. This is to alter the sub- 


routine so that both U, and U» are returned, enabling the investigator 
to determine by observation which value is smaller and, hence, which 
group has the larger sum of ranks. 


~ 
Siegel confuses this issue in his example (Siegel, 1956, РР. 121-123) by 
ys the value U* to compute a z, which led him to reject the null hypothesis 
EE of his group 2. The actual value of U in this case, Ms e pd 
in formula 2, is 640, which yields a similar, although negative, 2 of се. 
ШО either U ог U' will produce identical absolute values of z, it is Por v 
lo > experimenter to know which U value is used and from which i S 
muc 2, it is derived in order to know which group is larger. There ко р 
less chance of confusion if U is always used rather than U’, sin 
ауз represents the stochastically larger group. 


272 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


In the example presented in Table 1 the U; value (associated wi 
group 1) is six and the U» value (associated with group 2) is 24. 
smaller value, U;, becomes U. When compared against the tabled 
values of U in Siegel (1956, p. 271), this allows the assignment of & 
063 probability that so small a value could have occurred by 
chance. Directionality is determined easily, since the group which 
produced U is the larger. In the example case, this is group 1, 

It may rarely occur, but it certainly is not improbable that, for 
some studies, U’ will equal U. When U’ equals U, this means that 
all the scores in both samples are equal and, therefore, all receive ће 
same rank order. 

The characteristics of a program which makes use of the UTE 
subroutine modified in this manner are now described. 


Input 
1. System cards. 
2. Header card, to be printed on output. d 
3. Control card, containing: (a) number of subjects in smaller 
group; (b) number of subjects in larger group; (c) number of 
variables; (d) indicator for omission of correction for tiesi 
and (e) indicator for descriptive statistics desired. 
. F-type variable format card. 
‚ Data cards, with data for smaller group placed first. 
. Finish cards after last data set. 


Output 
as aeg identification information as punched on header card 
above. 


2. Descriptive statistics for all variables in smaller group and 
larger group, in that order. 

3. Value of U, and U» for each variable and value of 2 computed 
from the smaller of these values (U) if the number of cases in - 
the larger group is more than 90. 


Limitations 


о co» 


The sum of т; + na cannot exceed 200; and the number of varia- 
bles cannot exceed 50. 


Mery of this program are available upon request from the senior 
author. 
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A COMPUTER PROGRAM FOR NONPARAMETRIC 
POST HOC COMPARISONS FOR TREND* 


JAMES J. ROBERGE 
Temple University 


Ix behavioral science experiments involving K treatments the 
researcher is usually not satisfied with merely refuting the equality 
of the treatment effects. Instead, he is often more interested in per- 
forming particular post hoc comparisons, e.g; trend comparisons, 
among the treatments. For parametric analysis of variance models, 
a common procedure, following the rejection of the null hypothesis 
of equal expected values, is the use of orthogonal polynomials to 
estimate the magnitude of these trend comparisons. Recently, 
Marascuilo and McSweeney (1967) discussed analogous post hoc 
trend analysis procedures which may be employed following the 
rejection of the null hypothesis by a nonparametric test such as 
the Kruskal-Wallis (1952) one-way analysis of variance for rank 
data, the Friedman (1937) two-way analysis of variance for rank 
data, or the Cochran (1950) two-way analysis of variance for di- 
chotomous data. 

The program described in this paper is designed to perform the 
aforementioned nonparametric analyses. Specifically, it (a) cal- 
culates the statistie for a given nonparametric test, (b) compares 
this statistic with the chi-square value required for significance at 
the 5 per cent level, and (с) calculates post hoe confidence intervals 
for the trend comparisons, if the null hypothesis can be rejected at 
this level. 


Formulas 
The formulas presented below are similar to those discussed by 
Marascuilo and McSweeney (1967). The formulas used to calculate 


———— t 
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the statistics H, x,*, and Q, for the Kruskal-Wallis, Friedman, and 
Cochran tests, respectively, are those presented in statistics text- 
books commonly used in the behavioral sciences. In the case of the 
Kruskal-Wallis and Friedman tests, these statistics are corrected 
for tied ranks. 


Contrasts 


The arbitrary comparisons, ў, of the average ranks (or condition 
means) are calculated by the formula ў = а, Ё, + af, +--+ + ache 
where dı, Ga, * *- ‚ак are a set of coefficients of orthogonal polynomials 
and fı, Fs, ++- , Ёк are the average ranks (or condition means). 


Variances 
The variances of the various comparisons are calculated by the 
following formulas: 


4 
3- Kruskal-Wallis 
N(N +1 2j. al ( 
Vi - NWN +1) B a [LN j 
ar (9) | 12 BN 1) |27 Mee 

where № is the total number of observations (or ranks), d is the 
number of sets of tied observations, ¢ is the number of tied observa- 
tions for a given set s, К is the number of samples, n is the number of 
subjects in each sample, and a is as defined above. 


4 
3 — 
таш = EEEn rn 2-9 d ye 
12 nK — 1) | п 
where К is the number of experimental conditions, n is the number of 


— tested under the K conditions, and d, t, and a are as defined 
ve. 


(Friedman Model) 


K vs, D Ys? к 2 
Var (ў) = Елда) oe (Cochran Model) 


where S is the number of successes for subject 7 across the K condi- 
tions, and n and a are as defined above. 


Confidence Intervals 


The confidence intervals for the various comparisons are of the 
following form: 


$- Vxka (1 — a) VVar() < v < HVO j v/Var() 
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where x? has К — 1 degrees of freedom, а = .05, and ў and Var (ў) 
are as defined above. 


Input 
The job deck set-up for each analysis is as follows: 


Problem card 


Columns 1-2 = number of samples or experimental conditions (К) 
3-5 = number of subjects per sample or experimental 
condition (equal ns are required) 
6 = nonparametric test (1 = Kruskal-Wallis; 
2 = Friedman; 3 = Cochran) 
7 = trend analysis (1 = yes;0 = no) 


Coefficient cards 


If the user chooses to have a post hoc trend analysis performed, 
then these cards contain the linear, quadratic, and cubic (if K > 3) 
coefficients of the orthogonal polynomials; otherwise, they are omit- 
ted. Each set of coefficients must begin on a new card and must be 
punched according to 2613 format. 


Format card 

This F-type variable format card indicates the location of the 
raw scores (or ranks) on the data cards. This format may be 
punched in any of the columns on the card. 


Data deck 


These cards contain the data for each sample (or experimental 
condition) and must be punched in accordance with the format 
specified on the F-type variable format card (see above). For the 
Kruskal-Wallis test, the data are punched by sample with the data 
for each sample beginning on a new card. For the Friedman or 
Cochran test, the data are punched by subject (or group of 
matched subjects) with the data for each subject (or group of 
matched subjects) beginning on a new card. 


Last card 


1t the user wishes to terminate the program, then the card imme- 
diately following the data deck must have the word FINISH 
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punched in columns 1 to 6. However, if the user wishes to analyze 
another set of data, then this card is a blank card, and the job deck 
is arranged sequentially (as described above) beginning with the 
problem card. 


Output 


The computer output for a given nonparametric test includes 
(a) the value of the corresponding statistic, i.e., H, x, or Q, (b) 
the number of degrees of freedom, and (c) the average rank (or 
condition mean) for each sample (or experimental condition). In 
addition, if the user chooses to have a post hoc trend analysis 
performed, and the null hypothesis is rejected by the appropriate 
nonparametric test, then the output includes the 95 per cent confi- 
dence intervals for the linear, quadratic, and cubic (if K > 3) 
comparisons. 


Capabilities and Limitations 


The program, which is written in FORTRAN IV, can handle а 
maximum of 80 samples (or experimental conditions) and 200 
subjects per sample (or experimental condition). Jobs may be 
Tun sequentially as described above. 


Availability 


Copies of this paper and a source listing which includes input and 
output data for sample problems can be obtained by writing to Dr. 
James J. Roberge, Temple University, Department of Educational 
Psychology, Philadelphia, Pennsylvania 19122. 
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A COMPUTER PROGRAM FOR TREND ANALYSIS 
IN A TWO- ОВ THREE-FACTOR EXPERIMENT 
WITH REPEATED MEASURES ON ONE OF THE 

FACTORS! 


JAMES J. ROBERGE 
Temple University 


THE use of experimental designs with repeated measures on one of 
the factors is extensive in the behavioral sciences. Moreover, the 
repeated measures factor in these designs, e.g, delay of reinforce- 
ment, length of isolation period, or dosage of a drug, often has levels 
which represent equally spaced intervals along an underlying con- 
tinuum. Hence, additional information about the nature of the rela- 
tionship between the repeated measures factor (treatment levels) 
and the dependent variable can be obtained by partitioning the 
treatment variation into nonoverlapping trend components, ie., 
linear, quadratic, cubic, or quartic, through the use of orthogonal 
polynomials (Winer, 1962, рр. 353-869; Kirk, 1968, pp. 270-275). 

The program discussed in this paper performs an analysis of 
Variance for a two- or three-factor experiment with repeated 
measures on one of the factors. More importantly, it provides the 
researcher with the option of having a trend analysis performed on 
Tepeated measures factors of the type described above. 


Input 
The job deck set-up for each analysis is as follows: 
Problem card 
Columns 1-2 = number of levels of factor A. 
3-4 = number of levels of factor B. 


үсте author gratefully acknowledges the support for this research which 
Эз provided by a Faculty Research grant funded Бу Temple University. 
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5-6 = number of levels of factor C. 9 
7-9 = number of subjects per cell. 
10 = trend analysis (1 = yes; 0 = no). 
Coefficient cards 


If the user opts to have a trend analysis performed on the repeated 
measures factor, then these cards contain the linear, quadratic, and 
cubic (if the number of levels of the repeated measures factor is 
greater than 3) cofficients of the orthogonal polynomials; other- 
wise, they are omitted, Each set of coefficients must begin on a new 
card and must be punched according to 1018 format. 


Label cards 


These cards (one per factor) contain the alphanumeric labels — 
for factors A, B, and C (in a three-factor experiment), respectively. - 
The label for each factor may be punched in any of the columns on 
the card. 


Format card | 


This F-type variable format card indicates the location of the 
Taw scores on the data cards. This format may be punched in any 
of the columns on the card. 


| 
« 
Data deck | 
[ These cards contain the data for each subject and must be punched _ 
in accordance with the F-type variable format card (see above): 


Each card must contain the data for one subject. However, the 
use of more than one data card per subject is permitted. | 


Last card 


If the user wishes to terminate the program, then the card imme- 
diately following the data deck must have the word FINISH 
punched in columns 1 to 6. However, if the user wishes to analyze 
another set of data, then this card is a blank card and the job deck 
is arranged sequentially (as described above) beginning with the 


problem card. 
Output 


The computer output includes (a) the labels for the factors, (b) 
an analysis of variance table, ie. sources of variation, sums 0 
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squares, degrees of freedom, mean squares, and F-ratios, which is 
presented in а form similar to that used by Winer (1962), and (e) 
matrices of means, standard deviations, and standard errors, for 
all main effects and interactions. Furthermore, if the user chooses 
to have а trend analysis performed on the repeated measures factor, 
then the analysis of variance table includes the trend components 
for the within subjects main effect, interaction (s), and error vari- 
ation. 


Capabilities and Limitations 

The program, which is written in FORTRAN IV for processing 
by computers in the IBM 360 (or the CDC 6000) series, can handle 
а maximum of 10 levels for each of the factors and 10,000 obser- 
vations. Well documented, it has variable names which are mnemo- 
nie, and correspond to the symbols used in Winer's computational 
formulas, to facilitate modification by users. Jobs may be run se- 
quentially by introducing a new set of control cards and a new data 
deck as described above. 


Availability 
Copies of this paper and a source listing which includes input and 
output data for sample problems for two- and three-factor experi- 
ments can be obtained by writing to Dr. James J. Roberge, Temple 
University, Department of Educational Psychology, Philadelphia, 
Pennsylvania 19122. 


я REFERENCES í 
Kirk, R. E. Experimental design: Procedures for the behavioral 
„Sciences. Belmont, California: Wadsworth, 1968. ar 
Winer, B. J. Statistical principles in experimental design. New 

York: McGraw-Hill, 1962. 


Beccation ar ans. PSYCHOLOGICAL MEASUREMENT 
1971,31, 283-28. 


А STREAMLINED VERSION OF THE ALDOUS 
SIMULATION OF PERSONALITY! 


ROBERT А. LEWIS 
Iowa State University 


Tur ALDOUS simulation of personality was developed by Loeh- 
lin (1962) at the University of Nebraska and was later refined at 
the University of Texas. The program was а weighted additive 
model of impression formation where values of stimuli, perceptual 
accuracy, and weighting factors could be manipulated (Loehlin, 
1963). ALDOUS was basically a series of subroutines running in an 
environment (the main program) where inputs to the program 
represented stimuli from the environment. The model follows no 
formal theory of personality, but has been involved in several types 
of experiments (Loehlin, 1962, 1963, 1965, 1968). 

The original program (Loehlin, 1965) consisted of approximately 
750 assembly instructions for the Burroughs 205. Inputs and outputs 
for this version were entirely numbers. ALDOUS was later elaborated 
and rewritten in FORTRAN. The FORTRAN version required 624 
source statements when made compatible with the IBM System/360 
Model 65 at Iowa State University. When compiled in FORTRAN 
G, it occupied 23,810 bytes of storage. Inputs to this original FOR- 
TRAN version were numerical, and outputs were à series of sen- 
tences representing the model’s reactions to the environment. 


Streamlining ALDOUS 


The original FORTRAN ALDOUS consisted of а main program 
(the environment), 10 subroutines, and four functions. Since all but 
three of these subroutines were called only once by any other routine, 


SAF eras 
1 Computer time for this research was provided by а grant from the Dean 
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the author determined that a considerable savings of computer time 
could be effected И most subroutines were inserted directly in the 
main program in the appropriate places. A complete description of 
each routine is available (Loehlin, 1963). The following routines 
were collapsed into the main program: STIMUL, CONSEQ. Rew 
tines RECOGN, EMOTN, ACTION, REACTN, LEARN and RE- 
PORT were condensed into the ALDOUS subroutine. These changes 
eliminated 10 sets of subroutine dimensioning statements without 
changing the logie of the program. More than 28 per cent of the 
statement numbers in ALDOUS were unused and were removed. 
‘There were also several cases where variables were packed into one 
array to pass from one routine to another and then unpacked upos 
arrival in the new routine. This practice was eliminated. 

After this initial streamlining, the output volume was reduced. 
The original FORTRAN program used a series of numbers as input 
stimuli. These were retained. But the output consisted of а series of 
sentences constructed from concatenated phrases. In the modified 
program, verbal output was replaced by the numerical subscripts for 
the arrays which originally contained the phrases. This change re- 
duced the output for each experimental trial from five lines of 
Printed sentences to one line of digits which summarized the same 
information. It is now possible to replicate the design experiments 
in which hundreds of trials are run without generating an attendant 
mountain of output. Additionally, the numerical output can be di- 
rectly analyzed statistically without further translation. 


Testing the Revised Program 

In replications of Loehlin's well-known experiment of develop- 
ment in two different environments (Lochlin, 1963), the modified 
FORTRAN ALDOUS and the original FORTRAN ALDOUS per 
formed identically. The revised program, however, which is approx 
imately 1,000 bytes smaller than the original, has an average ® 
time of 31 per cent less. All experiments were conducted under * 
multi-programming environment. 

This revision of the original FORTRAN ALDOUS retains the 
original input format, but uses a streamlined internal logic to pT 
duce а more compact output which summarizes a process identical 
to the original program. Copies of the flowchart and listings of tb 
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A COMPUTER PROGRAM FOR ESTIMATING THE 
POWER OF TESTS OF ASSUMPTIONS OF 
MARKOV CHAINS 


ROBERT W. LISSITZ 
University of Georgiat 
SILAS HALPERIN 
Syracuse University 


Marxov processes are a popular model of longitudinal behavior 
(Gribbons, Halperin, and Lohnes, 1966: Lu, 1966: Atkinson, 1964: 
and Rapoport, 1969). They suggest themselves whenever a subject 
or sample of subjects is observed repeatedly across time and be- 
havior is recorded as discrete categories. The Markov model is & 


have demonstrated the usefulness of this model for students of 
human behavior. Those readers looking for & well specified, dy- 
namic, mathematical model should find their work of great interest. 
One of the most important advances in the study of individual 
differences has been the development of mathematical models and 
their appropriate and efficient utilization by the less mathematical 


but more empirical researcher. This mo 


del involves calculation of 


an initial vector of probabilities and a matrix of transition prob- 
abilities. Detailed descriptions of these parameters can be found in 


Kemeny and Snell (1960), and Anderson and Goodman (1957). 


Certain testable assumptions must be made, though, before the 
above parameters can be calculated. These are assumptions regard- 
ing the order and stationarity of the stochastic process. Anderson 
and Goodman (1957) discussed these assumptions and presented 


1The preparation of this manuscript was supported, in ра: 
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senior author was in residence, during this time, at the Psychometric Labora- 
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| very powerful one, allowing the user à great deal of parsimony аз 
well as providing explicit predictions. The references given above 
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two test statisties and а body of sampling theory for evaluating 
their truth or falsity. One test is based on the likelihood ratio sta- 
tistic and the other on a contingency statistic. In both cases these 
tests assume large sample size. Halperin (1966) studied the case 
where the null hypothesis is true and the test statistics limit (as 
sample size approaches infinity) to the chi-square distribution. The 
computer program described here concerns itself with the case where 
the null hypothesis is false and the experimenter is interested in the 
power of the statistical tests. In this case the test statistics limit 
(again, as a function of sample size) to the non-central chi-square 
distribution (Patnaik, 1949). 

Very little is known, from a mathematical standpoint, about the 
power of these tests. Lissitz (1969) examined some characteristics 
of power using the methodology of the Monte-Carlo procedure. The 
computer program reported here grew out of this work. 

This program allows a researcher to plan his study with regard to 
the power of these test statistics. He may use this program to select 
the number of states, the number of stages, and the sample size to 
obtain whatever level of power he desires. His problem would be 
simplified considerably if he could be certain of obtaining statistics 
which conform to the non-central chi-square distribution. For many 
research problems with small sample size this is not possible. Since 
no analytic solution has been derived for the sampling distribution 
of these statistics for small sample size, the researcher must depend 
upon Monte-Carlo procedures. This computer program performs 
the Monte-Carlo method and obtains an empirical solution. From 
this empirical sampling distribution the researcher is able to calculate 
the power of his tests and thus plan his research more carefully. 


Input 
The actual form of the card input is well specified by “comment” 
cards at the beginning of the program. Control of the program is by 
a series of cards prepared by the user. These cards specify the actual 
parameters of the population model (order I-stationary, order Ir 
stationary, or order I-nonstationary) of interest and the eritical 
values corresponding to the .10, 05, and .01 alpha levels. 


Data Generation 
The program uses the parameters of the hypothetical population 
to generate samples of data from which the test statistics Сап be 
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calculated. The following is the procedure for generating these ran- 
dom data vectors from an order one stationary Markov chain. 
Assume that there exists a chain whose initial probabilities are 
specified by the experimenter. It is desired to select а set of vectors 
from the totality of all possible vectors in this ehain. Represent the 
m element (where m is the number of states in the model) initial 
vector of probabilities as a partition of a line interval of unit length. 
We can generate, by the power residue method, a random number 
between zero and one from a uniform distribution. This number 
must fall into one of the segments of the partitioned unit length 
interval. The segment into which this uniform random number falls 
will determine that person's starting state, or his position at time 
zero. In the limit, the proportion of random numbers which fall into 
segment i and are consequently classified into starting state 1, vill 
be equal to the probability of starting in state i as given by the ith 
element of the initial probability veetor. 

Once an initial element 1 of the data vector is randomly generated, 
we are in a position to use the ith row of the transition matrix to 
generate the next element of the data vector. Using the ith row in & 
manner identical to that of the initial vector we can generate a sec- 
ond element of the data vector which is consistent with the proba- 
bilities set down in that row. Again, in the limit, the proportion of 
people who start in state i and whose next state is generated to be j 
will be equal to the corresponding probability specified in the transi- 
tion matrix. If we continue this procedure until the vector has T + 
1 (where Т is the number of transitions) elements, we will be able to 
interpret this datum vector as being randomly sampled from the 
specified Markov chain. 

A similar precedure is used for generating the data vectors in the 
case of stationary-order two Markov processes and for order one 
nonstationary Markov processes. The only meaningful difference 
between the case of generating data vectors from the stationary 
order two population and the case discussed above is in the procedure 
used to initialize the process. In this situation there are two initial 
states to be determined before the transition matrix can begin E 
scribing probabilities. The program was written to input one initial 
Vector and then to use it twice to generate two initial states. A 
Tandom number is generated, and it is compared to the intervals 
of the initial probability vector. Thus, this process defines 8 state in 
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the manner outlined above. A second number is then generated, 
and it, also, is compared to this vector, and thus the second initial 
state is given. Together, these states prescribe the particular row 
and layer of the order two transition matrix. 

The method of generating random data vectors for the norista- 
tionary order one population is identical to that for the stationary 
order one situation except for the choice of the transition matrix. 
Instead of there being but one matrix there are Т matrices and 
each of these is used in its turn. Again, the final result is a set of 
data vectors which conform, in the limit, to the population param- 
eters. 

Once a sample of data vectors is generated, the likelihood ratio 
statistic and the contingency statistic are calculated. This procedure 
is followed repeatedly, until an empirical sampling distribution is 
generated. The size of the sample and the number of samples аге 
parameters specified by the program user. 

In addition to the empirical values, a theoretical distribution is 
calculated, assuming that the limiting non-central chi-square dis- 
tribution is appropriate. This theoretical distribution is а two 
parameter one (degrees of freedom and non-centrality value). 
The degrees of freedom are given by Anderson and Goodman (1957); 
and the noncentrality value, from the same article and from Lissits 
(1969). The reader is referred to these sources for further discussion 
of this subject. Other methods for estimating the noncentrality value 
are possible but are not considered in this program. 


Output 

The output of this computer program is contained in two tables. 
One summarizes the distributions resulting from the Monte-Carlo 
sampling procedure and the theoretical noncentral chi-square sub- 
routine. These are reported in the form of a grouped relative fre- 
quency table giving the proportions within each of the 14 intervals 
and the cumulative proportions. This is done, of course, for each ? 
the two test statistics. Two summary statistics are provided: mean 
absolute difference between the theoretical and the empirical pro 
portions, and the number of intervals with an empirical proportion 
one the 95 per cent confidence limits of the theoretical prop 
ion. 


The second table in the output contains the power for the gpecifit 
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Type I error rates specified on the last set of input cards. These 
error rates are presented for the two empirically determined dis- 
‘tributions only. The first table allows (within the accuracy of in- 
terpolation) for calculation of power under the theoretical distribu- 
tion, if this calculation is of interest. 


Program Availability 


» The computer program is written in FORTRAN IV and currently 
operating on an IBM 360 model 50. The program calls for two 
subroutines from the Scientific Subroutines package (IBM, 1968). 
A listing of the program is available upon request from Е. W. Lissita. 
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SUBROUTINE TO DECODE IBM 1230 DATA 


JOHN W. MENNE 
Iowa State University 
JOHN E. KLINGENSMITH 
Arizona State University 


Tur FORTRAN IV/360 decoding procedure uses the machine 
bit form of data read in alphanumeric format to calculate in binary 
arithmetic an index which is used to address the stored vector of 
the decoded values. Decoding in this way seems to improve over rou- 
tines such as described by Veldman (1967, pp- 167-169) by about а 
factor of five. The subroutine is limited to 16/32 bit machines, but 
the procedure is adaptable to other machines. Other features are: 

1. Dimensions are supplied by the calling program, ав in IBM 
supplied subroutines. 

2. The same storage vector is used for the alphanumeric input and 
the decoded (integer*2 or integer*4) output. Of course, on input 
this vector is only partially filled with coded data. : 

3. Valid characters which are not part of the IBM 1230 code will 
not be returned to the calling program but will be printed with 
identification. 

4. The routine requires 922 bytes. 

Source deck and listing, including usual calling statements, are 
available from the Student Counseling Service at Iowa State Uni- 
versity. 
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COMPUTER PROGRAMS FOR RANK ANALYSIS OF 
COVARIANCE 


EDWARD P. LABINOWICH 
San Fernando Valley State College 


JAMES K. BREWER 
Florida State University 


Ix many experimental studies the nature of the sample dictates 


the adoption of nonparametrie statistical techniques, since these 


are based on less stringent assumptions and are often more general- 


izable than parametrie techniques. FORTRAN IV programs are 


provided in this paper for the non-parametric one-way analysis of 


In Quade’s original paper, various methods were discussed for 
the comparison of two or more samples with respect to a response 
variable Y in the presence of a concomitant variable (covariate) 
X—a situation for which the usual analysis method is а standard 
one-way analysis of covariance. Two distinct methods have been de- 
veloped by Quade—one for the analysis of covariance by ranks for 
one covariate and another for the case of two covariates. À separate 
FORTRAN IV program has been written for each technique. De- 
spite a distinct contrast in the specific details for the determination 
of the variance ratio by each method, the input and output for the 
computer programs are basically the same. 


Input 
Each job deck contains control cards which record the total 


sample size and the size of each group. The data decks are punched 


according to a format card and each data card provides the follow- 


ing information for each subject: the Y score, the Y rank, the X 


score and the X rank (X1 and X2 when there are two covariates). 
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Output 
The computer output includes for each subject: The Y n] I 
the X ranks corrected for the median, the predicted Y rank, and | 
residual from the regression of Y ranks. Values for the variables 
each stage of computing the final variance ratio, e.g., the sum 
squares, are included in the output. 


Capabilities and Limitations E 
The programs are documented and employ variable names w 
correspond mnemonically to the symbols assigned by Quade in| 


computational formulas. Both programs are applicable only to 
sample problems as illustrated in Quade’s examples. 


Availability 


А print-out and sample output of each program can be obtail 
by writing Dr. E. Labinowich at the School of Education, San F 
nando Valley State College, Northridge, California 91324, ог] 
James К. Brewer at the Department of Educational Research | 
Testing, College of Education, Florida State University, Tallal 
see, Florida 32306. E 
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Frederick С. Brown. Principles of Educational and Psychological 
Testing. Hinsdale, Ш.: The Dryden Press, 1970. Pp. vii + 468. 
$9.95. 


The content of Brown’s book on educational and psychological 
testing is summarized fairly well by the chapter headings: In order 
of appearance are chapters discussing the nature of measurement, 
test development, reliability, validity, scores and norms, how to com- 
bine scores, measurement in the domains of achievement, aptitude and 
personality, and, finally, problems and trends in testing. А 
account of descriptive statisties is given in an appendix to the first 
chapter. The book seems best suited to training test users, not test 
developers. This judgment is supported by the fact that relatively 
little space (approximately 20 pages) is devoted to item writing and 
item analysis whereas a relatively large amount of space (approxi- 
mately 165 pages) is used to deseribe different types of sta 
instruments and how they may be evaluated. As advertised in the 
preface, Brown has successfully avoided the temptation to include 
tedious catalogues of standardized instruments. , different 
types of standardized measures are illustrated through reference to 
familiar and widely-used tests and questionnaires. An excellent guide 
for evaluating tests is presented in outline form in Chapter 9. It 
will be of undoubted assistance to those readers who have the task 
of selecting a standardized instrument for a particular purpose. . 

The book has several other attractive features. It is well written 
in an informal style that should appeal to students. It includes im- 
portant topics not ordinarily found in introductory texts. Two ex- 
amples of such topics are (1) assessing the accuracy of selection 
decisions that are made on the basis of how examinees score on а 
predictor test, and (2) discriminant analysis. These, as well as more 
familiar topics such as factor analysis and multiple-regression, are 
developed with a view to having the reader understand the purpose 
and rationale of the technique. Numerical examples, standing sep- 
arately from the text, are provided to illustrate each statistical 
technique. But mathematical derivations are omitted to meet 
Brown's stated assumption that not all readers will have done 
courses in statistics. Also worthy of note is the study guide for the 
book. It contains previews of each chapter and questions that will 
help focus the student’s attention as he reads. iat 

The book’s positive points have been given first emphasis in order 
to create a suitable perspective from which to view the negative 
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comments that follow. These are offered to alert potential users of 
the book to difficulties that can be surmounted through class lectures 
and discussions. Very occasionally Brown lapses into the use ol 
questionable terminology and imprecise language. Reliability 
defined parenthetically as “degree of inconsistency” (pp. 52-3) but 
elsewhere as degree of consistency. Brown interprets standard erro 
of measurement (p. 85) and standard error of estimate (p. 115) as 
standard deviations of a normal distribution without explicitly note 
ing the assumption underlying this interpretation. And on two O 
casions (p. 113 and p. 192) the predicted score that results from 
the application of a linear regression equation in one predictor is 
described as the mean of the criterion scores made by persons witi 
the same predictor score. Brown should also have stated that this 
description provides only a convenient approximation to reality. 
In at least one case Brown can be faulted for failing to deve 
a concept fully, The notion of optimum test discrimination is trea 
as follows: “For tests designated to discriminate between students, 
a mean score slightly higher than 50 per cent of the maximum pos 
sible score is optimal (with an approximately normal distribution)" 

(p. 274). This assertion may represent Professor Brown’s ехрегіе! 
with what is possible in practice. Ideally, a rectangular dis : 
tion of scores would be desired if the situation required optimum 
discrimination among all the students tested. Rectangular distribu- 
tions are, of course, rarely, if ever, observed. But tests can probably 
be constructed to yield distributions of scores that are more pla 
kurtic than the normal distribution. Such tests would provide rela 
tively better overall discrimination among students than the 
which Brown speaks. 

4 Perhaps more serious are a contradiction and a bit of misinforma- 
tion that appear in the book. The contradiction occurs with res 
to the notions of interval measurement and standard score 808 
Brown gives the impression (p. 9, p. 163 and pp. 169-170) that b 
transforming raw scores to standard scores, measurement on 1 
level of an interval scale may be achieved. He does so after havil 
raised the question (pp. 7-8) of whether interval measurement 
possible for most educational and psychological variables. In | 
tion, he fails to acknowledge that if standard scores are on ап 
terval seale then so are the corresponding raw scores inasmuch 8$ 
they are merely linear transformations of the standard scores. _ 

The bit of misinformation is conveyed in the assertion ( 
that factor analysis can be used to determine whether a test 1 
factoral. Brown glosses over two problem factor analysts face 
deciding what type of correlation coefficient to compute when 
items are dichotomously scored, and (2) determining the rank 
the matrix being factored. Brown fails to acknowledge that if tl 
problems are solved and it is found that a matrix has a rank О: 
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then а necessary but not sufficient condition of unidimensionality 
has been met (Lord and Novick, 1968, p. 382). 

An attempt could be made to excuse many, if not all, of the fore- 
going complaints with the explanation that to overcome them would 
require the use of more sophisticated mathematics and more involved 
verbal explanations than is warranted in an introductory text. It is 
clear that we would argue this point. But potential users will have 
to make up their own minds about how intellectually honest an 
introductory text can be and still be written understandably. 

On two other occasions in the book, Brown made what we feel 
are arguable decisions. One involves the treatment of reliability. 
The Kuder-Richardson coefficients are presented as indices of homo- 
geneity, not of reliability. Homogeneity is а concept with a long 
and confused history. Brown tries to avoid most of the problems 
associated with it by ignoring the work of Louis Guttman, among 
others, and by defining homogeneity as “. . - consistency of per- 
formance over all items on a test” (p. 77). But this definition fails 
to clarify the concept for the reader. In particular, we were left 
wondering whether homogeneity by this definition could mean that 
all the items of a test have linearly related true scores of the special 
type referred to by Lord and Novick as “essentially tau-equivalent 
(Lord and Novick, 1968, p. 50). If it does, then the KR-20 estimate 
of reliability for a set of items that satisfy the definition equals the 
proportion of true score variance in the variance of observe 
scores, and this is how Brown defines reliability. This relationship 
between reliability and one conception of homogeneity provides 
an exception to the assertion “. . . that reliability will always be 
greater than homogeneity” (p. 81). Another, possibly unfortunate, 
aspect of this approach to the treatment of reliability is that it leads 
Brown to interpret split-half coefficients as estimates of equivalence 
when in fact they may also be regarded as estimates of i 
consistency. This follows from the well-known fact that the mean 
of all possible split-half coefficients for a test, each computed ас- 
cording to what Brown refers to as Guttman’s formula (р. 67), is 
equal to the value of KR-20 for the test. P нн 

The other decision about which Brown and we yao iiid 
use of the correction-for-guessing formula. He concludes wi it 
spect to the use of the formula “. . . that the burden of proof ea 
fall on the proponents of correcting for guessing (p. 273). vin 
agree that this would be the case where the instructions to the 
student do not specify what he is to do when he encounters à 
question that cannot be answered with confidence. But this is b 
practice. The instructions should tell the student either to guess or 
not, and in the latter case, it is surely necessary to motivate him not 
to guess by informing him of a penalty for wrong answers ог а ге- 
ward for omitted questions. When reward or penalty instructions 


300 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and the corresponding correction-for-guessing formula are u 
the resulting test scores can be expected to be more reliable ai 
more valid than if the student is encouraged to guess as Brow 
recommends. - 

On the basis of the strengths and weaknesses we identified in th 

book, we conclude that Brown has, in fact, produced а book 
is very good in many respects. It stands as a testimonial to Brov 
breadth of knowledge and eclecticism. Many instructors will um 
doubtedly find Principles of Educational and Psychological Testing 
useful for their courses. 

Having offered this conclusion, it is possible for us to deal with 
the broader issue of the contribution the book makes to the field 
of educational and psychological measurement. Here v: also con 
ment on the state of the field as it now exists. Brown's “ook is one 
of many to appear in the last few years, each providing a restat 
ment of the discipline in fairly traditional terms. This leads us @ 
ask whether these books were all necessary? Are “new” books bein 
written more to make singular contributions to the field or to satisf 
the desire of publishers for fairly traditional, eclectic, and therefo 
salable products? 

The American Scientist recently carried an article by Paul 
Weiss (1970) entitled Whither life science? One of the observatio 
Weiss made in his article concerns the textbooks that existed й 
biology when he started working in that field some fifty years agi 
Then, Weiss says, “Textbooks were few, comprehensive, original 
unique, almost everyone of them bearing the signature of a master 
but obviously there were wide gaps between the areas they covere 
Yet they had one important feature in common: they tried, 80 
more than others, to balance overindulgence in their particular spi 
ciality by pointing up the place and context of that speciality vithi 
the continuum of the living world. In this way we became aware t 
both the fundamental interconnectedness of all aspects of life am 
the appalling dearth of concrete knowledge about the interconne’ 
tions, provisionally labelled by symbolic terms" (р. 158). Altho 
Weiss does not say it, we sensed that he believes most present-@ 
textbooks do not present biological knowledge in the unique 87 
original way of older textbooks. More important is his implicat! 
that modern textbooks fail to define the problems or describe P 
goals of biology. 

It can be argued, we think, that much the same situation exists] 
the field of educational and psychological measurement. Amo 
recent new textbooks, one looks in vain for highly original cont? 
tions. Many new textbooks, Brown's included, are very comp! 
hensive in their coverage of measurement topics, but most teach” 
of educational and psychological measurement will know of V 
established books that are at least as comprehensive. And no ree 
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` book we have seen would really satisfy Weiss in that it adequately 

tates the problems and identifies the goals of educational and psy- 
chological measurement as it is practiced today. Perhaps what is 
needed to advance the field is for fewer people to spend their time 
writing textbooks and for more to devote their time to research 
and to the development of new measurement concepts and tech- 
niques. Then, in time, truly “new” textbooks could be written. 
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Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. 
New York: Academic Press, 1969. Pp. xv + 415. $13.50. 


The power of a statistical test, according to the classical theory 
of hypothesis testing advanced by Neyman and Pearson, is the prob- 
ability of rejecting а null hypothesis in favor of an alternative 
hypothesis when the alternative one is true. Three factors affect 
power: the probability of rejecting a null hypothesis when it is 
true (hereafter symbolized by a), sample size, and the standardized 
magnitude of the difference between the parameter value specified 
in а null hypothesis and its true value (hereafter referred to аз 
“effect size"). Power increases as a or sample size or effect nies m- 
creases. The advantage to a scientist who knows all this is that, in 
theory at any rate, he can plan his experiments 80 85 wis 
satisfactory degree of power. Now, the main factor ی‎ 
manipulation in planning experiments is sample size. rir r 
is designed primarily to help a researcher with foresight at 
the number of subjects he needs to achieve а degree 
power, 

Cohen begins the book with a chapter explaining the соз : 
of statistical power and discussing several of the problems noms E 
to be solved in the process of preparing the book. More а t 
these later. The next seven chapters, comprising over реге: 


ift : iba tistical tests and tables of 
he book, contain descriptions of sta f the following topics: 


ference between product- + correlation CO cients, 
product-moment c! „Жу 
that a proportion is different from 0.5, the test of the significance 


of a difference between two proportions, 
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the F test as employed in fixed-model analyses of variance and co 
variance. A final chapter describes the computational procedures 
used in generating the results. 

Approximately 180 pages of the book are devoted to power and 
sample size tables. (In addition, there are about five pages of miseel- 
laneous tables.) For specified levels of a, the power tables contain 
estimates of power for combinations of selected sample sizes and 
selected effect sizes. Each table also indicates how large an observed 
effect would have to be for it to be statistically significant, given 
а specified a-level and sample size. Consequently, the power tables 
can be used as an aid in significance testing. The sample-size tables, 
on the other hand, indicate how large the sample must be, again 
for a particular a-level, in order that a specified degree of power 
will be achieved with respect to a specified effect size. 

In preparing the tables, Cohen considered three a-levels: 0.01, 
0.05, 0.10. Consequently, three power and sample size tables are 
reported for each variation of each statistical test that was studied. 
Two factors determined which variations would be included in the 
analysis: (1) For some tests, nondirectional (two-tailed), as well 
as directional (one-tailed), alternative hypotheses cowxl be of in- 
terest. In such cases, separate sets of three power or sample-size 
tables are provided for each type of alternative hypothesis. This 
means that the amount of information available about such tests is 
increased because, to a reasonably close approximation, the tabled 
power (sample-size) values for nondirectional alternative hypotheses 
may be regarded as power (sample-size) values of directional alter- 
native hypotheses at one-half the reported level of «. On the other 
hand, the tabled power (sample-size) values for directional alterna- 
tive hypotheses may be regarded as power (sample-size) values of 
nondirectional tests at double the reported level of a. (2) For the 
chi-square test, sample size is independent of the number of degrees 
of freedom for the test. A similar independence exists in the F test 
of fixed-model analysis of variance and covariance; the independence 
is between the number of degrees of freedom for the lesser (ЄХ 
pected) mean square (which is dependent on sample size) and the 
number of degrees of freedom for the greater (expected) mea? 
square (which is not dependent on sample size). In both these 
рае ma of Dower and sample-size tables are provi 1 

erent, number of (sample-size-i ent) degrees 
eri О De ерден) dag, 

As indicated previously, Cohen had to solve several problems ™ 
performing the reported power analyses. One problem was to defin® 
a metric-free index of effect size for each statistical test, an iD 
that would not reflect the type of data accumulated in a parti 
study. Where possible, Cohen chose effect-size indices that wel 
related through the concept of ©... proportion of variance accoun 
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for in the dependent variable" (р, 12). This enabled him to solve 
another problem, that of providing a rough basis for judging the 
magnitude of an effect size. A small effect is defined as one that 
accounts for approximately one per cent of the variance in the 
dependent variable, a medium-sized effect for nine per cent, and а 
large effect for 25 per cent. Another problem was to illustrate how 
the results of the power analyses can be used, Cohen handled this 
problem by considering different “cases” involving a particular test 
and providing one or more illustrative examples of each case. For 
example, in the chapter on the ¢ test for means, five cases are con- 
sidered: the difference between the means of two samples drawn 


mean (case 3), and the difference between two means when the 
observations yielding the first mean are correlated with the observa- 
tions yielding the second (case 4). The power and sample size tables 
for the chapter on the t test were constructed using the assump- 
tions of case 0. Therefore, the tables do not necessarily apply to 
cases 1-4. Cohen indicates when they may reasonably be applied to 
савез 1-4, shows how to use them in those cases, and provides an 
idea of how large the discrepancies may be between the tabled re- 
sults and the true results. 

Statistical power analysis for the behavioral sciences could appeal 
to a wide range of behavioral scientists. It may be read by anyone 
who has mastered the statistics found in introductory textbooks on 
the subject. Cohen does not trouble the reader with detailed mathe- 
matical derivations. The writing contains intentional Te 
dundancy to enable the reader to test his understanding of concepts 
аз he progresses. And the many examples contained in the book 
promote understanding and help the reader to use the results. 

Despite these positive characteristics, it is doubtful that the book 
will be widely used. The reason for such a pessimistic prediction 
is that power analysis is usually very difficult to do in the planning 
stage of experiments because investigators do not often have a good 
idea of the effect size that may be . When ae emm 
unknown, it is impossible to determine the sample size ю 
achieve a given power level. Of course the problem may лк 
tacked in а different and possibly more useful way (G d 
Stanley, 1970, pp. 287-288). In this approach, the investigator E 
determines the size of the largest sample he сап afford to take- 
Then the problem is to determine whether this sample size 15 е 
enough to provide а satisfactory degree of power m the event ам 
the effect size is as small as it could possibly be and still be interest- 
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ing. Such an approach may reveal that a smaller sized sample wo 
maintain a satisfactory power level or it may indieate that the in 
vestigator's resources are so small that he cannot reasonably expe 
to detect an effect even if it is reasonably large. This type of thinke 
ing can be assisted by the information contained in Statistical Pow 
Analysis. 
Another reason why the book may not be used extensively if 
that it is not as comprehensive as it might have been. Many behav: 
ioral scientists employ nonparametric statistics in their research. Fo 
them, the utility of the book would have been enhanced had analy 
been made of tests involving rank-order correlation coefficients 0 
the Mann-Whitney U test, to cite two examples. 
There are other aspects to the book that may annoy the readel 
One thing that disturbed me is that Cohen failed to support many 
of his statements with appropriate references to the literature. Fe 
example, in his discussion of the test of the significance of a produe 
moment correlation coefficient, Cohen makes the following asser 
tion: “However, when significance tests [of the product moment 
correlation coefficient] come to be employed, assumptions of nom 
mality and homoscedastieity are formally invoked. Despite this, й 
should be noted that, as in the case of the t test with means, moder 
ate assumption failure here, particularly with large n, will not seri 
ously affect the validity of significance tests, nor of the power 
sociated with them” (p. 72). No reference is given in support © 
the assertion. Why is the scholarship of writers of statistics boo 
not held accountable in the same way that the scholarship of other 
scientists is? Another, relatively small point, that may disturb th 
statistical sophisticate, is Cohen’s use of bold face Latin letters t 
symbolize both population parameters and sample statistics. 10 
symbol for the sample statistic differs only in having the letter 
as a subscript. 
Finally, mention should be made of the fact that Statistical Powe 
Analysis contains its share of errors and ambiguities. The range € 
these faults, both in terms of type and degree of seriousness, ћ 
been documented by McNemar (1970). My favorite is the state 
ment that a nondirectional test has less power than a directions 
test “. . . provided that the sample result is in the direction predictet 
Since directional tests cannot, by definition, lead to rejecting t 
null hypothesis in the direction opposite to that predicted, thes 
tests have almost no power to detect such effects” (р. 5; essentially 
repeated on p. 39). s 
Educational and psychological researchers who attempt to us 
the logic of power analysis in planning their experiments and WA 
employ the tests discussed in Statistical Power Analysis will want 
add the book to their reference collections. They are particula и 
encouraged to do so if they find tables easier to use than nome 
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graphs. Beyond that, Statistical Power Analysis may have some 
utility as a supplementary reference for advanced courses in educa- 
tional and psychological statistics courses. But teachers are well 
advised to read both the book and MoNemar’s review with care 
before making class assignments. 
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Stanley Cramer, Edwin L. Herr, Charles N. Morris, and Thomas 
T. Frantz. Research and the School Counselor. Boston: Houghton 
Mifflin Company, 1970. Pp. vii + 202. $3.50 (paperback). 


In the introduction to this text, C. Gilbert Wrenn commented 
that this book was “for first-year graduate students or counselors 
on the job who have little research knowledge” and that it was 
written in a clear and straightforward style. These are accurate 
descriptions for the work and this writer would add that nothing 
has been lost by presenting the material in such а readable fashion. 
Throughout the text the authors attempt to convince school coun- 
selors that research is a necessary part of their responsibility and in 
so doing, the authors help to remove much of the aura surrounding 
the term “research”. $ 

The first few chapters present some starting points that the күй 
selor might utilize in launching certain research activities in the 
partieular setting in which he finds himself. Adequate descriptions 
were given of the types of research that the counselor might engage 
in, such as descriptive research, environmental — 
up studies (what happens to students after they leave school), 
finally research through which the counselor comes to some con- 
clusions about the effectiveness of particular counseling programs 
and techniques. However, the section on the “asking” of questions 
opened a very crucial area but could have been expanded upon 
much more. The chapter on elementary descriptive aset pe a" 
effective and particularly useful since the authors attempted fond 
bine it with some comments about the use of data, А oun 
in а school setting (e.g, cumulative records). Topics К уд 

tequency distributions, measures of central tendency, stan rem 
Viation, normal curve, and standard scores. The qo Oh Sane 

scores was somewhat compact and might be difficult for ae 
one who is first being introduced to such a concept, but this Is 
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quite a minor criticism of this particular chapter. Also practical 
and clear was the discussion on the construction and use of norms 
noting that the counselor should be encouraged to construct those 
which are most appropriate for the sample with which he is working. 

Chapters four and five led the reader into a consideration of 
the area of expectancy tables and correlation techniques. The reader 
was presented with the rationale for expectancy tables and com- 
ments were made about their use in communicating to students, 
parents, and teachers the predictive quality of some of the informa- 
tion available to students, This writer was pleased to sce that the 
authors urged counselors to see that the interpretation made to 
students on the basis of an expectancy table “is not a substitute” 
for counseling but just another source of information to be used in 
the process and the authors supported this contention by providing 
some discussion of the cautions to be employed when using such 
tables. The chapter on correlation techniques might be a leap into 
the unknown for some readers who have no prior experience with 
this concept. Also this chapter on correlation might have been tied 
in better with the previous discussion on expectancy tables. But 
after this start the chapter slows down enough so that even the 
counselor inexperienced in statistical techniques should be able to 
develop a sound understanding. Not too practical was the presen- 
tation on the various types of correlation techniques determined by 
the variable or variables being researched. It appeared doubtful in 
this writer’s opinion that most counselors would venture this far 
into the data. But the writers did cover well the implications for 
the utilization of correlation techniques. Reliability and validity 
seemed to get very little attention in this section and if the reader 
has not thoroughly examined these applications of correlation 
measurement he will be in for a very difficult time in this section. 

The authors set aside one chapter to introduce the reader to the 
new and developing research area of environmental assessment. Bê” 
cause the field is just beginning to get off the ground it is often dif- 
ficult to find much written on research rationale and methodology 
in this area and therefore, this chapter should be a welcome sigh 
to school counselors. It is obvious that every day it is becoming 
more important to have some evidence as to the impact of the 
school milieu upon the individual student. Described were some 
instruments available for assessing the environment of such settings 
as high schools, colleges and universities and most important seem 
to be the discussion on the means of taking on such a research task. 
The counselor probably has not always been able to meet the ex- 
pectations of institutional researcher for many reasons but this chap- 
ter should give impetus to those so inclined. 

Chapters seven and eight dealt with traditional research areas 
such as the overall evaluational guidance programs and followup 
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studies. Probably the most important research area in counseling 
will continue to be the evaluation of counseling services. Sensitive 
questions were raised early in the chapter but appeared to be the 
kind of issues counselors will have to respond to in the course of 
their research activities. “Does counseling do what it purports to 
do?” essentially was the basic question. The criterion program in 
such research was clearly discussed and examples of criteria were 
given which might have practical significance in school setting. 
Also mentioned were realistic problems of sampling and controls. 
Overall this chapter was quite thorough and reflected the importance 
which the authors appropriately assigned to this counselor respon- 
sibility. In a following chapter the authors also dealt with the for- 
malized followup of school dropouts and graduates, On the strength 
of some very basic followup research in guidance (e.g. J. W. M. 
Rothney’s work) the authors gave the school counselor a good base 
upon which to make decisions concerning followup. This is a very 
practical chapter and covered very well the alternatives of question- 
naires and interviews in such activities. Important also was not 
only a discussion of the procedures for planning and implementing 
& program of followup research but the authors wisely included 
some cautions to be observed while interpreting the data. 

Chapters nine and 10 dealt with class rank, academic average and 
the measurement of aptitude and opinion. Procedures were shown 
for using results of class ranking and academic averaging in order 
to communicate valid descriptions of designated groups in & school 
setting. The discussion of measurement of opinions and attitudes 
Was not unique and is the kind of discussion often available in most 
basic measurement texts. А 

Certainly one of the most useful chapters was the one on studies 
of school dropouts and the importance of such data to school systems 
when properly used. Many practical suggestions were given and 
discussed thoroughly were the four general types of research on 
school dropouts: the reasons approach where brief responses ате 
gathered, the case study approach, the factor approach which in- 
cluded the examination of such things as aptitude, socioeconomic 
Status, etc., and finally a broad social systems approach. 

It was mentioned earlier by this writer that environmental assess- 
ment is a new and recent development in school systems but also 
relatively new on the scene is the use of data processing аб many 
levels of education. The chapter on the counselor and data proces- 
sing focused on the need that the counselor will have to utilize such 
Procedures more and more. If some schools are not presently in- 
volved in extensive data processing, the time is almost upon us 
When this will be a common aspect of record keeping at all levels 
of educational operation. The authors covered well both pencil-and- 
Paper methods as well as automatic data processing. This discussion 
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was quite detailed and systematic but this reviewer found it to be 
very interesting and highly readable. Again the authors reflected 
upon all the important implications such as uses in the storage and 
retrieval of occupational and educational information utilized in all 
stages of individual development. 

What better way to conclude a work on practical research for 
the counselor than to include a look at the research activities of 
such men as Donald Super and John Rothney. Included was Supers 
Career Pattern Study, Rothney's Guidance of American Youth, and 
Guidance Practices and Results and Krumboltz's Revolution in 
Counseling. However, Krumboltz is at an early stage in his re 
search career and the counselor will probably have to wait some 
years to realize the practical significance of his work in comparison 
to that of Rothney and Super. 

When this writer first opened this text he felt that this might be 
just another organization of bits of basic statistics texts and measuré- 
ment texts but this was not true for this particular book. Both the 
school counselor and graduate students beginning their programs in 
counselor education should welcome this work both as a tool for 
learning and also as an excellent source of information to be kept 
close at hand in one’s work setting. 


В. В. Ѕімохо 
University of North Carolina at Charlotte 


William L. Hays and Robert L. Winkler. Statistics: Probability, 
Inference, and Decision. New York: Holt, Rinehart 
Winston, 1970. Volume I, pp. xviii + 650, $10.95. Volume п, 
pp. xiv + 320, $8.95. 


It has become more and more difficult to come up with an 
iutroduetory textbook in applied statistics that has a truly new 
look, but Hays and Winkler have done just that with their tw?" 
volume Statistics: Probability, Inference, and Decision. The n0v* 
elty of their approach lies in а unique combination of four ele- 
ments: (a) а comprehensive treatment of the fundamental ideas 
of probability and probability distributions; (b) the direct use 9 
these ideas to develop insight into the theory of classical statisti 
inference; (e) the thorough integration of Bayesian thinking 
decision theoretic concepts into the inferential process; (d) 8 con- 
cluding section (the second volume) on statistical methods. 
resulting mixture is an exciting one that gives an excellent ele 
mentary introduction to decision theory, but the work as 8 whole 
gives one the feeling of a lack of satisfactory closure. Basically, 
the two volumes do not seem to form a set. The material ® 
Volume I on probability and classical inference is clearly ne¢ 
for both the chapters in Volume I on Bayesian inference and 
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decision theory and for the statistical methods described in Volume 
IL However, Volume I builds up to an understanding of decision 
theory as a sort of ultimate inferential process, but then does not 
effectively carry this outlook over to the statistical methods in 
Volume II. It seems likely that the student who uses both volumes 
for a year's course in applied statistics for behavioral scientists 
will come out with a schizoid feeling about decision theory 

statistical methods rather than with an appreciation of the need for 


the development of methodology to routinely generalize decision 


theoretic thinking to the use of statistical methods. 


| 


Each of the two volumes has distinct strengths and weaknesses 
that deserve comment. The biggest fault of Volume I is that it 
attempts to deal with continuous distributions as well as discrete 
ones while assuming both a background in calculus and no back- 
ground in calculus (through the use of “heuristic” explanations). 
The result is sometimes like using the Encyclopaedia Brittanica 
as a first grade reader: the student may understand what is being 
talked about if the teacher supplies the vocabulary word by word, 
but he cannot do anything with the material by himself. This 
criticism holds principally for the mathematical sections and sub- 
sections. Hays’ talent for readable, detailed exposition is every- 
where evident and the major concepts are generally very clear 
and understandable, Appropriately, then, the greatest strength of 
Volume I is the clarity with which the authors show how decision 
theory provides tools for combining prior information about some 
phenomenon of interest with sample information to reach a decision 
that is best in some particular way, such as maximizing gain, ОГ 
minimizing loss, or some more complex criterion. 

. Volume II appears to be potentially less useful than Volume I 
in that it is not really compatible with Volume 1 and yet does not 
stand alone. The primary difficulty is that the selection of statistical 
methods covered is too narrow although often quite deep. The very 
complete development given regression and correlation and the 
much-better-than-average section on sampling are laudable, but the 
discussion of the principles of experimental design is both too 
brief and too scattered among the computational details of various 
experimental designs. Moreover, notions of multiple comparisons 
are almost totally lacking. The treatment given nonparametric 
methods is admirable, however. This latter section is well- 

and quite modern in outlook. 

In summary, although Hays and Winkler have produced a new 
approach to elementary applied statistics, it is questionable whether 
е effort as a whole has much to commend it. The volumes are 
comprehensive—even massive—and well written. They cover 12 
excellent fashion some materials on decision theory not heretofore 
available in elementary form. But, on the debit side of the ledger, 
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breadth and depth of coverage of statistical methods are not well- 
balanced and the authors consistently beg the question of teaching 
a mathematically-based discipline without requiring а sound trains 
ing in mathematics. As other authors who have faced this question 
have done, Hays and Winkler talk about the use of mathematieal 
procedures extensively without ever requiring the student to take 
the plunge and do some mathematies. This criticism is not Utopian 
in tenor: a great many interesting problems in statistics per se cam 
be tackled with only the elementary calculus as background. Never- 
theless, Hays and Winkler have done an excellent job in Volume E 
of presenting the basie notions of decision theory and Bayesian 
reasoning. In Volume II, the sections on correlation and regression 
and on noparametrie methods are exemplary. For these reasons, 
Statistics: Probability, Inference, and Decision should enjoy some 
deserved popularity during the period in which psychology makes 
up its mind that mathematics is the language of science and joins 
other disciplines in demanding of its students a thorough grounding 
in its essential branches. 


JAMES A, WALSH , 
Towa State University” 


Emil F. Heermann and Larry A. Braskamp (Eds.). Readings it 
Statistics for the Behavioral Sciences. Englewood Cliffs, N. Јл 
Prentice-Hall, 1970. Рр. іх + 419. $4.95. 


This collection of thirty-one papers, eleven of which were. 
published originally in the Psychological Bulletin and the majority 
of the remainder in recent issues of psychological and educational 
journals, is designed as a supplementary text for undergraduate 
and graduate courses in statistics or as a textbook for statistic 
seminars. Since students in the required basic and intermediate _ 
statistics courses have little enough time to study a core text andi 
work the assigned exercises, this book will probably find а more | 
receptive audience in graduate seminars. Although the readings 0 
and associated examples are aimed primarily at professionals Ш 
education and psychology (so perhaps the term “behavioral” 1 - 
the title is a bit pretentious), most of the issues and problems | 
discussed аге of a general methodological nature. | 

Predictably for a first edition, the book contains some typa 
graphical errors and oversights, but usually these do not detract 
One exception is on the back flyleaf, where someone neglected E 
include a description of the contents of parts I and П. Also there 8 
a surplus of papers on certain issues, for example the question OF 
the relative merits of parametric and nonparametric techniques - 
And a glance at a list of statistical reviews and notes appearing 1 
the Psychological Bulletin over the past 25 years reveals that many 
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important papers were omitted. In addition, certain older papers 
and а more thorough discussion of historical controversies would 
have been of interest to the reviewer, However, the editors had to 
make choices, and by and large they made excellent ones. 

The book is divided into six sections: I. History of the Appli- 
cation of Statistical Methods (2 papers), II. Parametric vs. Non- 
parametric Statistics (6 papers), III. Randomization (4 papers), 
IV. Testing Statistical Hypotheses (6 papers), V. Special Topics in 
Analysis of Variance (9 papers), and VI. Correlation and Regres- 
sion (4 papers). An introductory overview of the papers is given 
at the beginning of each section. 

The two papers in the first section, the first by Helen Walker and 

the second by Jerzy Neyman, deal with the contributions of Karl 
Pearson and R. A. Fisher and highlight the long-term argument 
between these two giants as well as the contrast between the 
Neyman-Pearson and Fisherian concepts of hypothesis testing and 
interval estimation. The majority of the papers in the second sec- 
tion, which is concerned with the issue of parametrie vs. nonpara- 
metric statisties, are relatively short. A provocative defense of the 
controversial matter of levels of measurement (Stevens, 1951) is 
presented in S. S. Stevens’ paper on “Measurement, Statistics, and 
the Schemapiric View." Finally, Donaldson’s discussion of the robust- 
m" of the F test is perhaps the most technical paper in this sec- 
ion. 
, The main issue considered in the third section (“Randomization”) 
is the importance of the randomization assumption underlying tests 
of hypotheses and the extent to which the sampling distributions 
associated with these tests approximate those of the exact ran- 
domization tests, The empirical results presented in the Baker and 
Collier papers on the effects of skewness, kurtosis, and number of 
observations on the F ratios in the completely randomized design 
and the effects of block-treatment interaction, kurtosis, and block- 
variance heterogeneity on the results obtained from the random- 
ized blocks design are important papers in this section. 

Many of the papers in section IV on testing statistical hypothe- 
ses, in particular those by Binder and Rozeboom on the logic of 
null hypothesis testing, demand careful study. Several of these 
Papers are quite technical and probably too difficult for the typical 
graduate student in psychology or education. However, they deal 
with some extremely important matters related to the epistemology 
of Statistical inference. 

Section V, “Special Topics in Analysis of Variance,” contains 
Ше largest number of papers of any section in the book. À reading of 
i ese papers should benefit almost all graduate students in behav- 
oral science statisties courses. For example, the Millman and Glass 
Paper on rules of thumb for writing the analysis of variance table 


314 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


will be a blessing to anyone who has found that conventional 
statisties cookbooks do not contain the layouts for all possible 
experimental designs. In addition, the Evans and Anastasio paper 
on misuses of covariance analysis should be read by-all who employ 
this method of controlling concomitant variables. Other important 
topics considered in this section are interaction, repeated measures 
designs, the homogeneity of variance assumption, and trend analy- 
sis with unequal n’s. 

The final section of the book, “Correlation and Regression,” 
should be of greatest interest to students and specialists in educa- 
tional and psychological measurement. Following two papers on 
normality and other assumptions underlying the interpretation of 
the product-moment coefficient is a detailed, thirty-one page treat- 
ment by Darlington of problems and issues in multiple regression. 
A short paper on Bartlett’s test of the significance of a correlation 
matrix, which will be a godsend to anyone who has vainly searched 
through Morrison (1967) or other books on multivariate statistical 
analysis for this type of test, completes the collection. 

In sum, Heermann and Braskamp have put together in a single 
volume a series of important, well-written papers, many of which 
are inadequately handled in required courses in psychological and 
educational statistics. Of course, the reviewer could quibble with 
the editors at length about some of their choices, but after reading 
ге all of these papers I felt that my time had been very well 
spent. 
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Benjamin Kleinmuntz (Ed.). Clinical Information Processing bY 
Computer: An Essay and Selected Readings. New York: Holt, 
Rinehart and Winston, 1969. Pp. xi + 399. $5.95 (paperback). 


Benjamin Kleinmuntz is among the most knowledgeable and 
thoughtful workers in the field of personality measurement an 
whatever he writes in this area deserves careful attention. In the 
present work he has composed an essay which runs to 106 pages- 
For purposes of organizing his book, Kleinmuntz decided to brea 
the essay into five sections and has inserted a number of readings 
into the gaps. Section I is titled, “Introduction: Computers as Com- 
putation and Noncomputational Information Machines.” This is 
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followed by readings by Hovland, Green, and Newell and Simon. 
Section II is titled, “Personality Assessment: Computational Ap- 
plications” and is followed by articles by Williams and Kleinmuntz 
and Eiduson et al. Section III is titled, “Personality Assessment: 
Noncomputational Applications of Computers” and is followed 
by readings by Tomkins, Iker and Harway, and Dunphy et al. 
Section IV is titled, *Medieal Diagnosis: Computational and Non- 
computational Applications" and is followed by readings by Lip- 
kin et al, Lusted and Stahl, Entwisle and Entwisle, and Feurzeig 
et al. Finally, Section V sets forth the author's thoughts about the 
future for computer processing of clinical data. 

Kleinmuntz's essay is clear and direct, albeit terse in & few spots. 
There is a notable lack of jargon without a sacrifice in ideas which 
should put the book within the reach of clinicians, the group for 
whom the book is mainly intended. To write with lucidity in a 
technical area is an achievement of no small proportion and Klein- 
muntz has carried it off exceedingly well. The introductory sec- 
tion of the essay presents basic information about what computers 
are and how they operate in a general way, but one which is 
sufficient for the author’s purpose. In the second section, Klein- 
muntz sets forth his point of view: 


Our point of view in this essay is & mechanistic one. Accord- 
ingly, we depict the psychologist as an information-processing 
organism who has collected direct observations, interviews, 
and tests as ‘inputs’ that he must process (analyze, organize and 
integrate) prior to ‘outputting’ his recommendations or pre- 
dictions. (p. 85) 


The author builds a strong and well documented case for his 
view that the computer is often as good if not better than humans 
in statistical processing of clinical information in psychodiag- 
nostic work. The argument is vintage Meehl. The author is on 
weaker ground when, in Section III, he develops the case for the 
noncomputational uses of computers (an interesting anomaly but 
one that’s been around for quite some time). Here, Kleinmuntz 
Sees computers as playing an important role in the collection of 
information, e.g., mental status interviewing, according to some 
predetermined set of rules. He notes that the interview process 
Contains sources of error arising from the interviewer, the inter- 
viewee, and the interview process. Kleinmuntz's hope, “. . . is that 
some of these sources of error will be minimized by automating the 
interview.” (p. 150) Perhaps so, but what new sources of error 
will be introduced in its stead is impossible to estimate. Much work 
over a long period of time will be required to even begin to de- 
termine the possible roles that computers can play in the collection 
of psychological data. 
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In section IV, uses of computers in medical diagnosis, Klein- 
munts returns to solid ground. The basically Bayesian approach 
to computer use in this area is well explicated and several neat 
examples are presented. On the other hand, the discussion of the 
use of the computer as а teaching device in medical diagnosis 
seems more optimistic than the present state of development war- 
rants, Kleinmuntz concludes his essay with a well balanced mix- 
ture of optimism and caution. 

The injection of caution is interesting in its own right. When 
this writer was introduced to computers in 1961, the field was 
filled with an almost unbridled optimism. Newell and Simon 
pointed the way and many of us were quick to follow. Borko's 
Computer Applications in the Behavioral Sciences (1962) mapped 
out a brave new world. The future, however, did not run out as 
rosy as expected. After the initial flush of successes documented 
in Borko's book, there was a marked dry period in several areas 
and almost a total closing out of others, e.g., computer music. This 
is reflected in the publication dates of the articles Kleinmunts 
selected. for inclusion in his book. Of the thirteen readings, four 
were originally published before 1962, three in 1963, one in 1964, 
four in 1965 and one in 1966. The absence of more recently pub- 
lished works reflects, this writer believes, a notable lack of progress 
inthe field rather than any lack of dilizence on Kleinmuntz's part. 

Where computer processing of clinical information has had its 
greatest successes and continues to be successful is in the area of 
statistical processing of data. Recent history has borne out the cor- 
rectness of Meehl's position. Future progress, however, would 
seem to depend not on the development of new statistical procedures 
and computer software, but rather on the development of improved 
measuring instruments and procedures. 

f In ers: Kleinmuntz has produced a book intended mainly 
or clinicians that is stimulating and provocative albeit slightly 

е. The deficiencies noted above are a reflection on 
the state of the field and not on the author. 
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particular area of experimental procedure. The identified categories 
are: general problems of psychological experiments, 
inherent difficulties of groups in experimental research, deserip- 
tion of problems to be investigated, procedures and available in- 
struments, social psychology of the experimental situation, and an 
introduction to simulation. The first four parts are preceded by 
an introduction which details the editors’ position with respect 
to the authors’ argument. 

Eighteen of the twenty-one papers are translated from English. 
These are generally well known and available in this country (e4t., 
Donald Т. Campbell’s Factors relevant to validity of experiments 
in social settings, Psychological Bulletin, 1957, 54, рр. 297-312). 
Consequently, the book adds little to experimental logy 
in the United States. 

The collection and the original articles would be of considerable 
value to individuals in this country interested in developing their 
fluency in English-French translation. 

ROBERT SMITH cy 
University of Southern California 

La recherche en enseignement programmé—tendances actuelles 
(Programmed learning research—major trends), actes d'un col- 
loque O.T.A.N. Nice 1968, Sciences du t 8, col- 
lection dirigée par F. Bresson et M. de Montmollin, Paris: Du- 
nod, 1969, pp. 360, 96F (paperback). cu 

This collection of papers was presented at а symposium 
Nice, France, from May 13 to May 17, 1968. The — | чш 
sponsored by the Scientific Committee (Consultant Group "O). 
man Factors) of the North Atlantic Treaty Organization (NA Ls 
The symposium is one of a series that allows specialists from 
several NATO countries to examine a subject of common interest. 

The papers are organized under four general headings: Analysis 
and Structure of the Subject Matter, R. Gagné, со-‹ C 
(United States) ; The Learning Process and Problem Solving, hs 
Atkinson, chairman (United States); Categories of Learning 
Criteria for Evaluation of Learning Outcomes, R. Glaser, irc 
(United States); Adaptive Machines, G. Pask, chairman (Grea 
Britain). The proceedings reflect the same general pi lerance 
of research from the United States. Programmed learning as viewed 
in this country vis-a-vis programmed teaching in the European 
countries dominated the discussions. ; METERS 

The general topics have received extensive consideration in this 
country and there seems to have been little of value added by 
moving the debates to a more salubrious clime. 


ROBERT SMITH eet 
University of Southern California 
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RELIABLE AND VALID HIERARCHICAL 
CLASSIFICATION"? 


LOUIS L. MeQUITTY aw» JEWEL M. FRARY 


University of Miami 4 
Coral Gables, Florida 


Some methods of hierarchical classification use only the highest 
index of association of every object with every other object; other 
methods use all indices of association (MeQuitty, 1967, 1968; Me- 
Quitty and Clark, 1968). A problem is to use that particular set of 
indices of association which produces the most reliable and valid 
solution. This is what Reliable and Valid Hierarchical Classification 
attempts to accomplish. " 


Characterization of Types 


Definitions 


The method is derived from definitions of types. . 

Pure types. A pure type can be defined as 8 category of two or 
more objects of such a nature that every object in the category 18 
more like every other object in the category, in terms of specified 
FRE, than it is like any other object in апу other cate- 

ry. 
, À pure type can also be defined as a category of two OF more ob- 
jects with a unique pattern of characteristics. Every object in the 
category possesses all of the characteristics of the pattern, and no 
object not in the category possesses all of them. The latter kind of 
Object can possess some but not all of the characteristics. 

“This investigati i Service Research Grant 
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vision and elaboration of a paper read at the annual tp hi l E 
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If this latter definition of a pure type is translated into the same 
terminology as the first definition, it states that a type is a category 
of two or more objects of such a nature that every object of the cate- 
gory is identical, in terms of specified characteristics, with every 
other object of the category, and no object not in the category is 
identical with the objects of the category. 

Аз another elaboration, a pure type can be defined as a category 
of two or more objects of such a nature that every object in the 
category is most like some other object in the category; every ob- 
ject is classified with the object most like itself. 

These three definitions of a type are three specifications of inter- 
relationships which characterize configurations constituting pure 
types. At the same time they suggest other configurations within 
the same general area as these but not necessarily satisfying com- 
pletely the definition of any one of them. From this latter and more 
general point of view, a pure type is a category of two or more ob- 
jects which belong together because of some inherent pattern of as- 
sociations among the objects. There is no limit on the kind of con- 
figurations expressed by the patterns; they must, however, be held 
together by intrinsic associations. 

Real types. Pure types exist in theory; their correlates jn nature 
are called real types. They resemble pure types but usually do not 
conform completely to them. 


A Numerical Display 


Both pure and real types can be numerically displayed jn ma- 
trices. One of the simpler ways of doing this is illustrated in the hy- 
pothetical data of Tables 1, 2, and 3. Í 

The entry of 2 in Row C—Column A of Table 1, for example, 
reports that Object C is second most like Object A; Objects B and 
D are first and third most like Object A. The other entries are in- 
terpreted in an analogous fashion. 

Table 1 illustrates fulfillment of the first pure type defined in 
this paper. In terms of the numerical display the two categories 
are called square types. The group of Objects A, B, С, and D, and 
the group of Objects W, X, Y, and Z, each constitute a square 
type. They are pure in the sense that no object has a rank larger 
than n-1 with any other object; n equals the number of objects in 
the submatrix. This condition means that every object in each 
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TABLE 1 
Within-Column Rank Orders for a “Square” Type—Hypothetical Data 


A B С D WwW x Ж. 7 


А 1 2 2 Entries in 

B 1 3 3 this quadrant 

С 2 2 1 are from 4 to 

D 3 3 1 7 inclusive. 

wW Entries in 2 2 1 
x this quadrant 3 1 2 
X are from 4 to 2 3 
Z 7 inclusive. 1 3 3 


р “o Tisi cL 


submatrix is more like every other object in that submatrix than 
itis like any object in any other submatrix. 

In Table 2, Objects E, F, G, and H are identical and are differ- 
ent from Objects 8, T, U, and V, which are also identical with one 
another, These two categories illustrate the second definition of а 
pure type, called an identity type. 

Table 3 includes first another square type, Objects I and J, and 
secondly it illustrates the third definition of a type, represented in 
this case by an elongated type. Objects K and L are reciprocal; then 
Object L brings in Object M because Object M has Object L most 
like it. In а similar fashion, Object М brings ОП ОДАН 
Object N brings in Object О. 

The essential associations of this latter type are shown more 
ery in Figure 1, which portrays the elongated feature of ше 

уре. 

Another configuration is labeled а spotted type; it resembles 
some one of the above configurations except that it varies from 
the standard configuration in certain cell entries and in specifiable 


amounts within those cell entries. 


TABLE 2 
Within-Column Rank Orders for an “Identity Type H ypothetical Data 


в ES сас 


Entries їп 


Е 1 1 1 

F 1 1 1 this quadrant 

G 1 1 1 are from 4 to 

H 1 1 1 7 inclusive. 

8 Entries in 1 

т this quadrant 1 1 1 
U are from 4 to 1 1 1 


= 
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TABLE 3 
Within-Column Rank Orders for an Elongated Type—Hypothetical Data 
I J K L M N о 


Note—All other entries are greater than one. 


An example of a spotted type is taken from a hierarchical clas- 
sification by Iterative, Intercolumnar Correlational Analysis (Me- 
Quitty and Clark, 1968) of real data and is reported in Table 
4. Three types are reported in the table, one for each of the three 
Reciprocal Pairs 3-6 (or 3-16), 10-20, and 4-12. Objects 4 and 12, 
for example, are reciprocal because 4 is highest with 12 and 12 is 
highest with 4. 

If each of these three types of 11, 6, and 3 objects respectively 
were a perfect representation of a pure, square type, then the first 
one would have no rank above 10, and the other two would have 
none above 5 and 2 respectively, i.e., none above n-1, where n = 
the number of objects in the type. 

The footnoted values in the table indicate the ranks which are 
larger than the prescribed standard. This is the sense in which the 
types are spotted, and the size of the discrepant ranks shows how 
much they exceed the upper limit of n-1. The value, 14, in Row 13— 
Column 5, for example, has a deviation of 14-10 — 4; n-1 = 10. 

When data have been ranked within columns, a rank reflects а 
spot in а submatrix when its value is larger than n-1, where т i8 
the number of objects in the submatrix and the ranks are those lifted 
from the matrix being divided. 

Nonmembers 


: An advantage of the concept of spots is that it can be applied to 
indicate nonmembers. Object 8, classified with Objects 4 and 12 in 


КЕМ N -— 0 


Figure 1. The essential structure of type KLMNO. 
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Results of an Hierarchical Classification by Iterative, Intercolumnar Correlational Analysis* 
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the lower right-hand corner of Table 4, can be characterized as & 
nonmember because of the size of its deviation; out of the 20 ob- 
jects of the original matrix % is only tenth most like each of the 
Objects 4 and 12 with which it is classified. In order to fit perfectly 
into the square type, it would have to be second most like Objects 
4 and 12. 

A problem is to determine when spots are sufficiently large to 
reflect nonmembers. All nonmembers yield spots but all spots do not 
reflect nonmembers. One practical approach is to require that every 
object be classified in every set of data; a spot then reflects & non- 
member if, and only if, one or more objects are not correctly clas- 
sified within the set of data. In an incorrect classification, there 
are some larger spots than there would be if all objects were cor- 
rectly classified; they, therefore, reflect nonmembers. The correct 
classification of data which conforms perfectly to theory would 
eliminate all spots. The correct classification of data which does not 
conform would not eliminate all spots; it would, nevertheless, re- 
duce at least some spots. Those spots which are eliminated or even 
merely reduced, in going from an incorrect to a correct classifica- 
tion, reflect nonmembers in the incorrect classification. In this ap- 
proach, a correct classification must be determined by a criterion 
other than whether or not nonmembers are present; otherwise the 
definition of nonmembers and correct classification would both be 
са One such criterion is the logic of the method of classifica- 

ion. 

Another way to characterize nonmembers is in terms of the data 
by which objects are classified; a spot reflects one or two nonmem- 
bers in a set of data if the one or two objects can be classified with 
asmaller deviation in another set of data. 

Tn the latter approach, spots can be hypothesized to reflect non- 
members from two interrelated points of view: (1) the objects 
seem not to conform to the other objects of a set in terms of con- 
tent, and (2) the nonfit is supported by the fact that the objects yield 
spots which are relatively large. Hypothesized nonmembers are 
then examined in a set of data with objects thought to be more 
appropriate to their classification and if they yield lower deviations 
in the new set of objects they are then confirmed as nonmembers 
in the original set of objects. 


Nonmembers are generated also in a purely objective fashion. 


5 
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If an analysis of n objects yields one or more types and one oF 
More unclassified objects, the one or more unclassified objects are 


thereby nonmembers. 
Every object is a nonmember at its terminal level of classifica- 


tion (from top down) where it stands alone and is contrasted 


only with other nonmembers. In order to distinguish these latter 
nonmembers from those characterized above, they are called enti- 
ties rather than nonmembers. 

When a matrix divides to yield one nonmember and à submatrix 
of two or more other objects, the latter submatrix constitutes а 
pseudotype. In order to fulfill the requirement of а type, à subma- 
trix must be separated from at least two other objects. 

lf every division throughout а hierarchical analysis yields the 
above kind of result, then the entire structure 18 pseudotypological. 


Mental Health Theory 


„Тһе concept of spotted types has another advantage. It is hypothe- 
Sized that “mental patients” reflect more nonmembers, more spots, 
and more extreme spots than “normals” and that they do this more 
extensively in matrices of interrelationships between characteris- 
ties within the single individual than they do in matrices reporting 
relationships between persons. These are hypotheses for later 
studies; their investigation is made possible by the approaches of 
this paper. 


The Method 


ү Тһе concept of pure types can be used to gene 
or the isolation of real types. 


rate a simple method 


Hypothetical Data 
to the first defini- 


The method is generated with specific reference 
which fulfill the 


tion of pure types and with hypothetical data 

definition. 

b Tables 5 and 6 report rank orders within columns for the mem- 
ers of two sets of pure types; Objects B, C, and D of Table 5, for 

example, are first, second, and third most like Object A- Other 

column entries are interpreted in an analogous fashion. 
Table 7 combines in a single matrix the within column ranks of 
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TABLE 5 
Within-Column Rank Orders for the First Set of Hypothetical Data 


aes 
A B с р 


А 1 2 2 
B 1 3 3 
с 2 2 1 
р 3 3 1 


the first two tables. In this larger matrix, the objects of the first 
two tables are intermingled. 

By definition of а pure type, all of the empty cells of Table 7 
(other than the diagonals) have entries larger than n-1 = 3. This 
is because the two square types are required by definition to have 
all of the entries of n-1 and smaller. 

The two types reveal themselves clearly in the patterns of ranks 
е 7 аз encompassed by the lines enclosing each of the two 

There is, however, another simple way for isolating the types, 
even if the types are only real—not pure—and do not reveal them- 
selves clearly in а form such as illustrated in the table. 

In the two columns on the extreme right of Table 7 are reported 
first the number of ranks of one in each row and then the number 
of ranks of one and two combined in each row. The numbers in 
the first of these two columns are the criteria for classifications in 
terms of a rank of one, and those in the second column are criteria 
in terms of ranks of one and two combined. 

The analysis starts with the highest rank (one) and the largest 
criterion for one and proceeds first by lowering the criterion for 
ranks of one and next by lowering the ranks to include one and 
two; one, two, and three; one, two, three, and four, etc. Within 
each decrease in rank the largest criterion is first used and then 


TABLE 6 


Within-Column Rank Orders for the Second Set of Hypothetical Data 
SSS 


Ww 2 2 1 

x 3 1 2 

LE 2 1 3 

2 1 3 3 
L—— ee l 
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TABLE T7 
W üAin-Column Ranks of Tables 6 and 6 Combined in a Single Matris 


Бы». 


= с ox msg 


 _‏ ا 
successively lower criteria until exhaustion before proceeding to &‏ 
further decrease in rank.‏ 

Every time the ranks are lowered, such as from one to one and 
two combined, the classification starts over again. 

The largest criterion for а rank of one in Table 7 is one, and it is 
reported in all eight rows. The starting point for tied criteria is 
unimportant, but all of them must, of course, be used. 

Row A of Table 7 assigns Objects A and B to & common (уро 
because B has a rank of one in this row. This fact is 
in the first row of Table 8 by an asterisk in Row A snd Column В. 
Analogously, Row B of Table 7 assigns Objects А and B to the same 


Neither Object A nor B, nor any other object, assigns апу other 


member to the above type. These two members constitute а type; 
A is most like B, and B i most like A. In an analogous fashion, the 


two combined in each 
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TABLE 8 
Assigning the Objects of Table 7 to Types of Two Objects Each 


a. eh Se B Y Z Criterion 


* 1 
Ww . 1 

x * 1 

с Ы 1 

* D 1 

* B 1 
Ms X 1 

* Z 1 


row of Table 7. These entries are the criterion values for the analy- 
sis. In analyzing the last Column of Table 7, the initial step applies 
to the row or rows with the largest criterion, three in this case. 
There are three rows with a criterion of three. The start is with the 
first criterion of three, from top down, chosen arbitrarily. Under the 
criterion of three, Row A of Table 7 assigns Objects A, B, C, and 
D to a type, as summarized in Table 9 by asterisks in Row A and 
Column C, D, and B; Row W assigns Objects W, X, Y, and Z to 
a type; and Row C assigns A, B, C, and D to a type, confirming 
the action of Row A. 

The analysis is not carried beyond the above point because the 
matrix of Table 7 has now been bifurcated into two submatrices 
in terms of the most dependable indices, those reflecting ranks of 
one and two and excluding the larger ranks of three and above. In 
this case, continuing with the larger and less dependable ranks 
Would reconfirm the above results. Such is not, however, necessarily 
the case when isolating real types rather than pure types. 

The above analysis was initiated by using first the most depend- 
able ranks, viz., those of one which are based on the highest index 
of association within every column. The analysis proceeded toward 
bringing in the lower ranks, first the ranks of two, then three, four, 


TABLE 9 
Assigning the Objects of Table 7 to Types of Four Objects Each 


A WwW x С D B i Z Criterion 


А * * * 3 
Ww x * * 3 
С М * 3 


س ا EE‏ 
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ete. until the matrix was bifurcated. Ranks of one and two bi- 
fureated the matrix of the present example. 


Real Data 


Table 10 reports agreement scores between spoons based on the 
presence and absence of characteristics as judged by а single sub- 
ject. 

The data were chosen for illustrating the present method because 
they contain a number of unique problems and they had proven 
dificult to analyze in an earlier study (MeQuitty, Price, and Clark, 
1967). 

Table 11 reports the ranks within columns for the data of Table 
10. In the case of a tie in agreement scores, all tied scores are аѕ- 
signed the rank which would be given if there were only one score 
at that value and all of the other scores were smaller. For example, 
the seven highest agreement scores in Column 16 of Table 10 are 
34, 33, 30, 30, 30, 30, and 29. They are assigned ranks of 1, 2, 3, 3, 
3,3, and 7 respectively, as reported in Column 16 of Table 11. 

Column 2 of Table 12 reports the number of ranks of one in 
every row of Table 11. Successive columns of Table 12 report the 
number of ranks of 1 and 2; 1, 2, and 3; and 1,2, 3, and 4, respec- 
tively, in every row of Table 11. 

Columns 2, 3, 4, and 5 of Table 12 are used to assign spoons to 
types. The highest criterion in Column 2 of Table 12 is 5. It per- 
tains to Spoon 3. The further analysis begins, therefore, with 
Spoon 3. As shown by the ranks of 1 їп Row 3 of Table 11, Spoon 
3 assigns itself and five other spoons to Type 1, viz., 3, 6, 9, 14, 16, 
and 17. This fact is reported in Table 13. х 

The next highest criterion is 4 for Spoons 6 and 16, as shown in 
Table 12. Since there is a tie, both spoons will be analyzed; the 
one which is selected first is immaterial. Spoon 6 assigns Spoons 
3, 6, 9, 11, and 19 to a type, determined to be Type 1 because of 
the overlap in assignments of these spoons with spoons already as- 
signed to Type 1. Spoon 16 assigns Spoons 16, 1, 3, 9, and 13 to 
Type 1. 

The analysis continues by using assignment spoons with suc- 
cessively smaller criteria until all spoons are assigned. Spoon 10, 
under a criterion of 3, assigns Spoons 10, 7, 15, and 20 to а type, 
determined tentatively to be Type 2 because there is no overlap 


| 
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TABLE 12 
Criterion Values for the Rows of Table 11 


Number of ranks of : 


Code Number of Spoons 1 land2 1,2,and3 1,2,3,and 42 
1 0 1 2 5 
2 0 3 5 5 
3 5 7 9 10 
4 1 1 2 з $ 
5 1 1 1 1 
6 4 5 8 8 
7 0 4 6 6 
8 0 0 0 0 
9 1 4 5 9- 

10 3 4 4 5 
11 1 2 3 3 
12 1 2 3 3 
13 0 0 1 4 
14 0 0 1 2 
15 1 3 4 7 
16 4 6 6 7 
17 0 0 1 4 
18 0 1 1 2 
19 0 2 3 3 
20 2 5 5 6 


——— ا ا ا‎ B 


with Type 1. Spoon 20, under a criterion of 2, assigns Spoons 20, 
10, and 18 to Type 2, so determined because of the overlap of thesê 
spoons with those already assigned to Type 2. Spoon 4, under & 
eriterion of one, assigns Spoons 4 and 12 to a new type, tentatively 
Type 3, because there is no overlap with previous types. . 
Analogously, under a criterion of one, Spoon 5 assigns itself and 
Spoon 8 to Tentative Type 4; 9 assigns itself and 13 to Type l; 
11, itself and 5 to Type 1; 12, itself and 4 to Type 3; and 15, itself 
and 2 to Type 2. > ‘ 
For convenience in reading the typal memberships, they are sum- _ 
marized in the bottom row of Table 13. There is but one conflict 
in the above assignments; Spoon 5 with a criterion of one assigned 
itself and Spoon 8 to Type 4, but also with a criterion of one 
Spoon 11 assigned itself and Spoon 5 to Type 1. This latter action 
also assigned Tentative Type 4 to Type 1 and thus eliminated 
Tentative Type 4. This occurred because once Spoon 5 was assigned 
to Type 1, Tetative Type 4 produced an overlap with Type 1. The 
other two types, 2 and 3, are defined without conflict. d 1 
A complete hierarchical analysis often classifies the objec s 
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TABLE 13 
Assigning Spoons to Types Based on Ranks of One 


- = _—————————=— 
1234 5 67 8 9 10 11 12 13 14 15 16 17 18 19 20 Criterion 


Types 
1 3 " . . + 5 
1 LI 6 * Ld * 4 
E 5 . . 16 4 
2 ы 10 . . 3 
2 . . 20 2 
3 4 : 1 
4 5 z 1 
1 9 e 1 
1 ы 11 1 
3 s 12 1 
2 1 


E 15 
Туры 12134112411 21311 SPLINA 


into two categories at the top level of classification. In the above 
results there are three categories. The analysis is continued using 
ranks of 1 and 2 combined in lieu of just a rank of 1. 

Table 14 derives from Table 11 and from Column 3 of Table 
12 for ranks of 1 and 2 combined in the same fashion that Table 
13 was derived from these tables for ranks of 1, exclusively. 

By the time the criterion of 3 for ranks of 1 and 2 combined 
(Column 3, Table 12) is exhausted, all spoons are assigned to 
Types 1 and 2 in Table 14, and there are no conflicts. The analysis 
of the original matrix into two categories is complete. 

In order to illustrate the eventual degeneration of assignments 
in relation to declining indices of reliability and validity as lower 
tanks are incorporated into the analysis, the method was applied 
to ranks of 1, 2, and 3 combined and to ranks of 1, 2, 3, and 4 
ig with the results reported in Tables 15 and 16 respec- 
ively: 

The classification into types by ranks of 1, 2, and 8 combined eon- 
firmed the classification by ranks of 1 and 2 combined. Ranks of 
1, 2, 3, and 4 failed, however, to yield a classification of all objects 
into one of two types. 

As summarized in Table 16, criteria of 10, 9, and 8 assigned nu 
objects except 2, 4, 7, 8, 10, 12, 18, and 20 to Type 1. A criterion 
of 7 had to be applied. Using first Object 15, th impntenon 
of 7, all objects, except Object 7, were assigned to Type з € 
tinuing with a criterion of 7 and using Object 16, Object 7 
Still remained unassigned with all other objects assigned to Type 1. 
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Proceeding with a criterion of 6, all objects were assigned to 
Type 1. Consequently, the ranks of 1, 2, 3, and 4 combined failed 
to assign every object to one of two types. 

The classification by ranks of 1 and 2 combined (and confirmed 
by 1, 2, and 3 combined) is favored over that of 1, 2, 3, and 4 
combined because it is based on higher ranks and it classifies every 
object into one of two classes. 

The indication of the above results is that the relatively high 
indices of association between objects can be used to yield reliable 
and valid assignment to types and that the lesser indices sometimes 
fail and cannot always be used with the same dependability. 

The analysis is continued by applying the above procedures 
separately to each of the two types (or submatrices) which derived 
from the above analysis, using ranks of either 1 and 2 combined, 
or 1, 2, and 3 combined. 

The procedures are applied here to the analysis of the larger of 
the two submatrices to illustrate some results not obtained in the 
above analysis. 

Table 17 reports the ranks within columns of the agreement 
scores of the spoons of Type 1, as isolated in Table 14. A column 
on the right reports the ranks of one in every row of the table. 
The information of this column is used to assign the spoons to types 
as shown in Table 18. Spoon 5 drops out because all other spoons are 
assigned prior to it. The other spoons constitute a pseudotype- 
Spoon 5 is a nonmember at this level of classification. 

A result like the above is a solution because it fulfills our defi- 
nition of a pseudotype. It does not, however, classify all objects 
into one of two types which is a goal to be realized when possible. 

Whether or not the latter purpose can be relized is determined 
by continuing the steps of the analysis as summarized in Tables 
19, 20, and 21. 

The additional analysis shows that the best solution by ranks 
of 1 and 2 combined is to assign all but two objects to a type 
The best solution by ranks of 1, 2, and 3 combined and also by 
ranks of 1, 2, 3, and 4 combined is to assign all but one object toa 
type, viz., Object 13. Ranks of 1, 2, 3, 4, and 5 combined yield 8 
criterion of 10 (out of a total of 11 objects) and therefore assign all 
objects a single category (composed of the assignment object and 
the other 10 objects). Increasing the ranks will continue to yield 


TABLE 17 
Ranks Within Columns of Agreement Scores for the Spoons of Type 1 
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TABLE 18 
Assigning Spoons of Table 17 to Types Based on Ranks of One 
=——————————————_—_ ج‎ 
Туре 13 5 6 9 H 13 14 16 17 19 Criterio 
1 3 LI LI . . . 5 
1 LJ 6 LJ * . 4 
1 * * . . 16 4 


س 


the same result because the criterion never decreases with an in- 
crease in ranks. 

The original solution based on the rank of one is accepted be- 
cause it classifies as many objects as any other solution and does 
it on the basis of higher ranks. 

Table 22 reports the ranks within columns of the agreement 
scores of the pseudotype. A rank of 1 with successive criteria first 
of 5 and then of 4 yields the results shown in Table 23. The criterion 
of 5 fails to assign the objects to two types. Criteria of 4 and 5 place 
all objects in the same category. 

Ranks of 1 and 2 with successive criteria first of 6 and then of 
5 and 6 yield the results shown in Table 24. A criterion of 6 fails 
to assign all of the objects and the criteria of 5 and 6 assigns all 
objects to the same category. 

In situations like the above (represented by Tables 23 and 24 
and also earlier in Table 17) there are at least two possible solu- 
tions. One solution involves an assumption: if a rank of т assigns 
all objects to a single category without realizing а solution, it is 
assumed that every larger rank would also assign them all to а 
single category without realizing a solution. Under this approach, 
the investigator returns to an earlier stage of the analysis which 
did not assign every object to but one category. He selects that 
stage which classifies the most objects. It is in this case a rank of 
1 and 2 and a criterion of 6 (Table 24). It assigned Objects 1, 3, 
6, 9, 13, 14, 16, and 17 to a type and left Objects 11 and 19 un- 


é TABLE 19 
Assigning Spoons of Table 17 to Types Based on Ranks of One and Two Combined 


———————————————————— 


1 ku. sec V * * ж 7 
1 beer ch Ы YT MCI А 6 
1 * 6 * * ab * » 5 
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TABLE 20 
Assigning Spoons of Table 17 to Types Based on Ranka of One, Two, and Three 
Combined 


| —— ee 


Type 1 3 5 6 9 n m M 16 17 19 Criterion 
1 $44 LI * * * . LI LI . 9 
1 LI 6 . . . . . LJ . 8 


کے ا“ 


assigned and therefore nonmembers. (The approach could have 
yielded more than one type and one or more nonmembers.) 

An alternative approach to granting the above assumption is to 
investigate it, i.e., to proceed with larger numerical ranks. The 
combined ranks of 1, 2, and 3 with the top criterion for them of 
8 assign all objects to a single category as shown in Table 25. 
Ranks of 1, 2, 3, and 4 combined act in the same fashion. Ranks 
of 1, 2, 3, 4, and 5 combined yield a criterion of 9 and necessarily 
assign all 10 cases (the assignment object and the 9 associates) to а 
single category. Any further ranks would do likewise because they 
too would yield a criterion of nine. In this particular set of data, 
the assumption is substantiated. 

When the above alternatives do not agree, the alternative which 
classifies the most objects into types is accepted as the better. 

The analysis proceeds in the above fashion until all objects have 
been assigned to types (which cannot be further decomposed by the 
method) and nonmembers. If more than two nonmembers are gen- 
erated throughout the entire analysis, they are assembled into & 
matrix and the entire method is applied to them. The results of the 
complete analysis are shown in Figure 2. 


Results 
The first level of classification assigned all 20 objects to one 3 
two types, The second level generated one nonmember d 
and assigned the other 19 objects to one of three types. The 


TABLE 21 xs 
Assigning Spoons of Table 17 to Types Based on Ranks of One, Two, Three, 
Four й 
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TABLE 22 
Ranks Within Columns Derived from Agreement Scores 


Number of Ranks off 


13956 8 10038974 16°17 99 1 1&2 1, 2, &3 
1 Bo Sone & 6 3 8 о 1 2 
3 2 tie eee ks 11 $ б 6 8 
6 TX fees oa "1-4 5 8 
9 232 Z4 * ҮҮ 4 1 4 5 
1 8 63 5 SS HT “Se 3 0 1 2 
18 4964 6 WF. ИР. ЭШ „0 0 1 
14 406 06 8 1. 5 259 0 0 1 
16 Ar CE ee 3 2 5 4 6 6 
17 $$ 8.9 .9.,7 6.8 8 0 0 0 
19 ws 2750 2 9 9 9.8 0 1 2 


level generated four nonmembers and two entities and assigned the 
other 13 objects to one of two types. The fourth level generated 
three nonmembers and assigned the other 10 objects to types. The 
fifth level of classification generated 10 entities; the two submatrices 
from which they came did not generate other types. 


Summary of Procedure 


A matrix of interassociations between objects is prepared. The 
interassoeiations are ranked within columns, with the largest 88- 
sociation being assigned a rank of one, the next largest a rank of 
two, the next largest a rank of three, etc. In the case of a tie such as 
34, 33, 30, 30, 30, 30, and 29, the ranks would be 1, 2, 3, 3, 3, 3 
and 7 respectively; the highest rank (smallest numerical valui 
represented by the tied values is assigned to all of them, and 
rank of the next lower association is based on the assumption that 
the tied values used up as many ranks as there are tied values. 
These steps yield a matrix of ranks within columns. 

The next step is to count the number of ranks of one within each 
row of the matrix and record it in а column labeled criterion f 


TABLE 23 
Assigning Spoons of Table 22 to Types Based on Ranks of One 
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TABLE 24 
Assigning Spoons of Table 22 to Types Based on Ranks of One and Two 
Type 1 $3" 8 ORD ee 17 19 Criterion 
1 * 3 LI * . * . 6 
1 * LI * * * 16 . 6 
1 * 6 * . * * 5 


ranks of one; if а row contains five ranks of one, then it has а 
eriterion of five for ranks of one. 

An effort is made to assign all objects to types on the basis of 
the criteria for ranks of one. The largest criterion is used first. If 
more than one object (row) report the largest criterion, then all 
objects tied for the top criterion must be used one at & time in 
attempting the assignments of objects to types, and it is immaterial 
which object is used first, second, third, ete. An object with the 
highest criterion assigns itself and all objects with а rank of one in 
its row; they are assigned tentatively to Type 1. 

Assume that the object did not assign all other objects. Another 
object with the highest criterion (if there are ties) or an object with 
the next highest criterion (if there are no ties for the highest) assigns 
objects to a type based on a criterion of one. If there are overlaps 
between the objects of this type and Type 1, then all of the assign- 
ments are to Type 1. If there are no overlaps, the new assignments 
are to Type 2. 

The above steps complete the classification at the first level of as- 
signments if and only if every object is assigned to one of two types 
and there is not another object tied in criterion with the object 
just used in making assignments. : 

If the above steps do not complete the first level of classification, 
another object with the highest criterion is chosen (if there are 
three or more ties for the highest criterion) or an object with the 
second highest criterion (if there are two objects for either the first 
or second highest criterion) or an object with the third highest 


TABLE 25 : 
of 1, 2, and 3 


Assigning Spoons of Table 22 to Types Based on Ranks 
= 1з 69 9 лї 99 INE 19 Criterion 
h LEE EE IE S. "eremo 8 
1 “ 6. oe * * * * * * 8 
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OBJECTS 
1 to 20 inclusive 


1, 3, 5,6,9,11 2.4.7 8,10 
LEVEL 1 13,14, 16, 7 9 12.15, 18, 20 
13,6911 |C5 > 4| [2,7 810 
LEVEL 2 13,14, 16, 17 19 12| |15,18,20 


13,69 ] (1195 ض1‎ 28D 
LEVEL 3 езд 


LEVEL 4 9.13) [12581 «IE: 


1 
ui dem 
LEVEL 5 3 6 9 14 16 dicc i —— PE 
[12,3] [8,119] 18> 
125 5579 
C E73 Tyres ——FIRST HIERARCHICAL | 
O <-> NON-MEMBERS CLASSIFICATION 
FTTT ENTITIES ——-—CLASSIFICATION ОЁ 


NON-MEMBERS Р 
THE FIRST HIERARG 
CLASSIFICATION. 


Figure 2. Reliable and valid hierarchical classification. 


eriterion (if there are no ties for either first or second highest eri- 
terion). It assigns itself and all objects with a rank of one in its 
row to a type. If the new type has one or more overlaps in assign- 
ments with but one other type, it becomes a part of the type with 
which it has overlap. 
If the new type has overlap with more than one type, it and all 
af the types with which it has overlap are combined into a single 
ype. 1 
The above steps continue until all objects have been assigned {0 
either one of two types or to one type. 1 
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In the case of the latter outcome, the rank on which the criteria 
is based is increased by one and the count of the number of qualify- 
ing ranks within rows is the number of ranks of z (2 in the present 
sage) and numerically less. ji 

The process repeats the above steps until either all objects are 
assigned to one of two types or a criterion of m-1 has been reached, 
where m is the number of objects in the matrix being subdivided; & 
eriterion of m-1 assigns its object and the m-l other objects to & 
single type. 

When a criterion of m-1 has been reached, the process reverts 
to the rank and criterion which classified most objects into types 
(preferably without classifying all of them into one category) ; ob- 
jects not classified are categorized as nonmembers. If there is а 
tie for the minimum number of nonmembers, the classification by 


with a criterion of z for ranks of y and higher. 

When the classification of the original matrix has been completed, 
the steps are repeated with the submatrices (types and pseudotypes) 
derived from the original matrix and then the successive su 
until at the bottom level every object stands alone as an entity. 

All nonmembers produced at other than their respective bottom 
levels of classification (i.e., all nonmembers which are not entities) 
аге assembled into a matrix and the process is repeated on them. 

If in the analysis of nonmembers, more than two 
&re again generated, the process is repeated until all possible non- 
members are classified. 


Summary 

Pure types exist only in theory and are free from characteristics 
Which do not conform to typal memberships. Real types, on the 
other hand, exist in nature; they approach but frequently do not 
correspond perfectly to pure types. 

Insofar as real types do not conform to pure types, they can be 
characterized as spotted. Both the number of spots and the extent 
of discrepancy of each spot from its pure correlate can be deter- 
mined. 

“Mental patients" probably reflect more spots than “normals,” 
and their spots are probably more discrepant than those of “nor- 
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mals.” “Mental patients" probably reflect these differences. 
elearly in matrices reporting interassociations between character 
ties of the single individual than in matrices reporting int 
tions between persons. 
In a matrix of interassociations between the members of al 
pure types, every index of interassociation reflects valid typal met 
bership. In a matrix of interassociation between members of à 
real types, on the other hand, only the higher indices ger 
reflect typal relationships validly. Consequently, in the iso 
of real types, by the method of this paper, only the larger indi 
interassociations are used. More specifically, the indices are ned 
rank orders within columns of a matrix. A solution is attempted first 
in terms of ranks of one exclusively, then one and two exclusively, 
then one, two, and three exclusively, ete. The analysis at each 
level of hierarchical classification concludes with the earliest. 
thus, the most valid solution. However, a continuance until eo 
arise gives additional objective evidence related to validity and 
further objective evidence of the similarity of the real types to pure 
types. 
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SYSTEMATIC SCORING OF RANKED DISTRACTORS FOR 
THE ASSESSMENT OF PIAGETIAN REASONING LEVELS! 


DAVID Н. FELDMAN? лхо WINSTON MARKWALDER 
University of Minnesota 


Gurman and Schlesinger (1967) recommended that item die- 
tractors be systematically constructed to inerease the information 
yield of test devices. In describing the typical purpose of distractors 
Guttman and Schlesinger wrote: “Keeping the correct answer com- 
pany is usually regarded as the only function of distractors, on 
sufficient attraction is deemed sufficient qualification for being а good 
distractor” (p. 569). At least three desirable additional benefits can 
be derived from distractors, according to Guttman and 4 

1. Successful prediction of relative empirical difficulties of dis- 

tractors 

2. Reduction of variation in test results due to undesired factors 

3. Possibility of differential scoring of subjects on the types ef 

wrong answers to which they are а 

The research reported by Guttman and Schlesinger (1967) pro vide i 
data bearing on the first two propositions; the 
in this paper bears on the third. Systematic scoring of ranked sadi 


and thank the principals stance of а schools in Ben Кей QUE 
fornia and موت‎ Н in the study. This report із 
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tractors was used in this study to diagnose types of wrong answem 
as they reflect different Piagetian reasoning levels. Specifically, the 
purposes of the study were threefold: first, to determine if а test 
of achievement in a specific content area could also be used to diag- 
nose the level of thinking at which a child tends to respond; second, 
to compare a “reasoning level” analysis with а more conventional 
index of performance; and third, to test an assumption of Piagetian 
theory as it pertains to a specific school task. The assumption is 
that reasoning development is attained in a fixed sequence of stage 
by all children, although not necessarily at the same rate. 

A new spatial reasoning (map reading) task was designed te 
gather data on the efficacy of a ranked distractor instrument for 
educational diagnosis. The task was based on a conceptual analysis 
of the geographic map which attempted to specify the skills re- 
quisite to proper map understanding (Salomon, 1968; Salomon and 
Feldman, 1969); the technique was somewhat similar to that used 
by Gagne’ (1962) for analysis of skills requisite to long division. 
The test consisted of 25 items constructed to assess skills requisite 
to map drawing; the iterns designed to measure these skills were 
ordered into a hypothesized fixed sequence of acquisition. Previous 
studies (Feldman, 1969, 1970) indicated that performance on these 
25 items was positively related (r = 46, df = 268, p < 01) 40% 
map-drawing criterion, thus giving some measure of concurrent 
validity to the map-reading test. Other validation techniques 
adapted from Eisner (1967) and Piaget and Inhelder (1967) ar 
reported in detail elsewhere (Feldman, 1969, 1970). 

Of the 25 test items, 17 were multiple-choice with four distractor | 
and a blank space if S wished to write his own answer. The f - 
maining eight items required Ss to perform some reasoning opere | 
tion (such as indicating the four directions around a map) or 1? 
write an answer (such as a rationale for picking a city as the capi 
of a mapped island). All 25 items were designed to induce response 
at four reasoning levels based on Piaget’s theory of cognitive 
opment (Piaget, 1950; Piaget and Inhelder, 1967; Sullivan, 1967): 
The reasoning levels were selected to correspond to the major stage 
between six and 14 years. In the case of the multiple choice item’ 
each of four distractors was designed to reflect (a) tautological of | 
imaginative reasoning, (b) perceptual/associative reasoning, (©) 
concrete reasoning, or (d) formal reasoning. For the remaining eight 
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Bemas, responses were evaluated on the basis of the above four reason- 
isg levels according to standardized procedures. 

Two scores were calculated. A map reading score (MS) was com- 
pated for each subject based on the number of items answered 
‘correctly,’ i.e. in a manner similar to traditional seoring methods. 
All formal responses were counted as correct, as were some 11 con- 
erete responses which would be acceptable answers on а regular 
geography test. To obtain a reasoning level (RL) measure the item 
divtractors were assigned values of one through four in ascending 
order of development, and a reasoning level mean was computed 
for each subject with the following formula: 


(Imaginary/Tautological X 1) + (Perceptual/ 
Rs Associative X 2) + (Concrete X 3) + (Formal X 4) 
25 (Number of Items) ` 


RL was compared with MS for reliability and relationship to other 
variables. 


Hypotheses 

Hypothesis 1 predicted that RL increases as grade level increases. 
This hypothesis was intended to test the validity of RL as а measure 
of cognitive development level. In previous studies (Feldman, 1969, 
1970), MS was found to increase significantly with grade level; 
because MS, i.e. achievement, was supposed to be dependent upon 
increased cognitive development, RL should also increase with age. 

Hypothesis 2 followed from the work of Turiel (1906, 1909) 
and others on stages of moral reasoning development. Turiel and 
his colleagues have found that responses to moral questions tend 
to be distributed in an almost gaussian fashion, with most responses 
being at a single stage, fewer responses at stages + 1 stage from 
the modal level, still fewer responses more than + 1 stage from the 
Modal level (see Figure 1). It was predicted that RL responses 
Would also tend to exhibit a modal dominant response stage, fewer 
responses more than + 1 stages from the modal stage. Positive 
Tesults for Hypothesis 2 would support RL as a valid measure of 
tognitive development, since Piaget (1990) argues for the under- 
lying unity of all cognitive development. Thus, findings for spatial 
reasoning and moral reasoning should be related to general cogni- 
tive development stages in much the same way. 
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Figure 1. Profile of moral stage usage on Kohlberg moral judgment inter- 
view (from Rest, Turiel, and Kohlberg study; taken from Turiel, 1969). 


Hypothesis 3 predicted that correlations of MS and RL with 
variables such as IQ, map drawing, sex, and ethnicity would not 
differ significantly. To an undetermined extent, RL and MS were 
artifactually related since they were computed from the same data. 
MS and RL for a given individual would have to be similar since 
those responses counted as correct were either formal or concrete; 
the very responses which would also contribute to a high RL. Süll, 
MS and RL could and did vary in individual instances, as in the 
following hypothetical example: 


1 
© 
© 


81: MS = 15 15 Formal X 4 points 
10 Concrete X 3 points 
0 Perceptual X 2 points = 0 
0 Imaginary X 1 point = 0 
RL = 8.80 = 90/25 
$2: MS=15 4 Formal X 4 points = 
11 Concrete X 3 points = 33 
0 Perceptual X 2 points = 0 
10 Imaginary X 1 point = 10 


RL = 2.36 = 59/25 


Hypothesis 4 predicted that despite possible differences in achieve- 
ment and reasoning levels according to ethnic background, 


\ 
©з 
о 
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could be shown to be proceeding toward formal thought 
h the same set of stages. Previous studies (Feldman, 1969, 
had found that when all 25 items of the map test were 
wubiected to scalogram analyses, each of three ethnic groups’ рег- 
formance tended to form a scalable item set, but the items were 

quired in a somewhat different order within each group. Previ- 
| research on sequences of cognitive development has supported 
Piaget’s claim that stages of cognitive development are invariant 

bhlberg, 1968; Wallach, 1963). Feldman (1970), in reviewing 
literature and in trying to explain his disparate results, con- 
luded that individual differences are likely to affect stages of acqui- 
sition of reasoning skills when there are many specific tasks to 
lest a limited span of development. Therefore, hypothesis 4 pre- 
“dicted that a reanalysis of Feldman’s (1970) 25 step sequence in 
terms of Piaget's major stages would yield a more invariant se- 


ү quence. 
he 
Method 
Description of Subjects and Sample 


£c Two samples, chosen for different purposes, were the sources of 
] data for the study. A sample consisting of 270 fifth, seventh and 
minth grade public school students of equivalent social class (lower 
Working class) but differing in ethnicity was drawn to test the 
effects of ethnic differences and grade level on reasoning levels and 
Sequences. Subjects were distributed evenly across three ethnic 
groups (Black, White, and Chinese) and the three grade levels. 
. This sample was drawn from four public schools in San Francisco, 
California (for more details of the sample and sampling techniques, 
see Feldman, 1969, 1970). To gather data on the stability of the 
ents, fourth and sixth grade students (V = 88) attending 

&St. Paul, Minnesota elementary school were sampled. This sample 
Was chosen for its wide SES range and middle class bias, ie. be- 
Cause it was more representative of typical school populations than 


the San Francisco sample. 


LI LS 

i Procedures for Administration 

ы À primary purpose of the testing procedure 
| E practical constraints, the dependence of а child's 


was to reduce, within 
performance upon 
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his ability to read. Another purpose was to reduce the anxiety 
test taking to а minimum so that each child had the best 
tunity to exhibit his reasoning about spatial concepts. 

All Ss were tested in the classroom by the same examiner (E) 
accompanied by one or two assistants. For most groups, E was of 
the same ethnic background as the Ss. After a brief informal warm 
up to establish rapport, E read all directions, each question and its 
distractors aloud calling attention to the option of S to write in - 
his own answer, proceeding as slowly as was necessary to insure 
each child the opportunity to think about and complete the items. 
The Е and assistant(s) answered all questions by Ss on а one-to-- 
one basis when Ss indicated that they did not understand an item. - 
Students were given individual assistance in expressing their own 
responses, and some who were obviously handicapped by inability 
to write with facility dictated oral responses to E or an assi 
who then wrote down the response. All groups were told that they 
were “testing the test”—this idea seemed to appeal to them. On the 
whole Ss seemed to find the items interesting and appeared to be 
motivated to select the responses which seemed “best to them.” 


v 


Instrument Validity and Reliability 


The validity of the instrument was tested against a map drawi 
eriterion in the San Francisco sample. The correlation between m 
drawing skill and map reading scores was found to be .46 (p < 
01) for the entire sample; sub-group correlations did not si 
cantly differ from the sample. Thus, it may be said that the in 
ment is a reasonably good predictor of a map-drawing criteri 
The restrieted variability in map categories (six categories) 
have affected the map category x map reading score correl 
(see Feldman, 1969, 1970 for more details). s 

In the St. Paul sample, test-retest reliability coefficients of 14 
for map reading scores (MS) and .81 for reasoning level (f 
were obtained. Practice effects appeared to be insignificant; а m 
map score of 16.85 for two classes which had not taken the 
October (N = 48) compares closely with the retest map 
mean of 17.03. Thus, the increase from 15.52 in map score mean 
October to 17.03 for the retest in January ean probably be attrib 
to causes other than practice effects (e.g., motivation, school | 
perience). | 
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TABLE 1 
Map Score Means and Standard Deviations for the San Francisco Sample 

(М = 270) 

5 Grade Level 

Ethnic Group 5th 7th Total 
Back X 11.97 13.40 15.17 13.51 
SD (3.81) (3.37) (3.51) (3.57) 
White £ 14.20 16.17 17.20 15.85 
SD (3.57) (3.15) (3.67) (3.47) 
x 14.57 .40 .63 16.53 
SD (3.07) (3.43) (3.39) (3.30) 
Total X 13.53 15.63 16.63 15.30 
SD (3.50) (3.32) (3.52) (3.86) 


Results 
Hypothesis 1 


Hypothesis 1 predicted that RL would increase significantly with 
grade level, Table 1 shows MS means and SDs for the San Francisco 
sample (N = 270). Table 2 shows RL means and SDs for the same 
sample. Previous results (Feldman, 1970) have shown grade level to 
be a significant (p < .01) influence on MS; Table 3 presents an 
analysis of variance testing the effects of grade level on RL. As can 
be seen from Table 3, grade level significantly influenced RL (р 
< 01), supporting Hypothesis 1. 


Hypothesis 2 


Hypothesis 2 predicted that Ss would respond most frequently 
to distractors at the reasoning level hypothesized to be the S's 


TABLE 2 
Reasoning Level Means and Standard Deviations for the San Francisco Sample 
(N = 270) 
Grade Level 
ж Ethnic Group 5th Tth 9th Total 
Black X 2.65 2.67 2.94 2.75 
SD 0.33 0.36) (0.34) 
Chinese X £ 9 : $4 53) 3.07 
5р 0.29 0.33) ў 
White X Oe MES 3.16 2.99 
SD (0.32) (0.35) (0.33) 
EISE 2.81 2.92 3.07 2.94 
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TABLE3 
Anaiak of Vernon Dealing ИМЕН Grede Lond end Bünde Group on 
Means (N = 270) 
Meee HL MEE 
Source df MS F Р 
Grade Level 2 2.37 14.81 <.01 
Ethnic Group 2 1.63 21.55 <.01 
GL x EG 4 0.16 1.45 N 
Error 261 0.11 


dominant stage, less frequently + 1 reasoning level away from the 
dominant stage, still less frequently more than + 1 category away. 

Figure 2 presents an analysis of cognitive stages which ра 
Turiel's method of determining the profile of responses relative 
a dominant stage. Ss were categorized into dominant cogni 
stages on the basis of the following decision rule: Ss who had 
of 1.00 to 1.75 were hypothesized as tending to respond at a ta 
logical level; RLs of 1.76 to 2.50 at a perceptual level; RLs of 2 
to 3.00 at a concrete level; RLs of 3.01 to 4.00 were considered to 
responding formally (see Figure 3). These response ranges 
selected somewhat intuitively; decisions were made on the basis 


-2 -1 0 +. +2 


Stage Level Relative to Dominant Stage 


E со 2. Profile of cognitive stage responses on Salomon/Feldman (I 
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previous findings and best guesses (Feldman, 1969, 1970). On the 
basis of the categorizing method described above, no subjects fell 
into the Tautological stage, 33 fell into the Perceptual stage, 109 
into the Concrete stage, and 128 into the Formal reasoning cate- 
gory. Although the resulting distribution of responses is flatter than 
that found by Rest, Turiel, and Kohlberg (1969), and although 
a restricted number of categories (4 versus 6) may have distorted 
the distribution somewhat, the data tend to support the prediction; 
ie, responses did tend to fall into the hypothesized dominant 
response category. Obviously, one could adjust the RL ranges to 
produce increasingly better fits to the predicted distribution; how- 
ever, only one algorithm was used in this study. 

It should be noted that only at the Formal level is there 
necessarily an artifactual relationship between RL and the modal 
response to distractor types. As was illustrated above in two hypo- 
thetical cases, a given RL could be achieved with a variety of pat- 
terns of responses. Thus, it appears as if RL categorizing produced 
results consistent with those of previous research (Turiel, 1969; 
Rest, Turiel and Kohlberg, 1969), and also consistent with the 
cognitive-developmental theory of Piaget. 


Hypothesis 8 


Table 4 presents a correlation matrix of MS, RL and five other 
variables. Table 4 shows that MS and RL correlated 93. It was 
predicted that MS and RL would correlate with sex, 1Q and grade 
level to about the same extent. In view of the high 
between MS and RL, it was almost inevitable that the two mea- 
sures would exhibit a similar pattern of intercorrelations with other 
variables, Thus, although Hypothesis 3 was no clear 
idication of the extent to which map achievement (MS) and rea- 
soning level are related to other variables will be available until 
MS and RL are assessed with different instruments. The only indi- 
cation from previous research of the relationship between map read- 
ing and reasoning level (Salomon, 1968) is a 53 (p < 91) correla- 
lion between a variation of the map test used in the present study 
and a Piagetian spatial reasoning task (objects-on-a-slope). Аз 
reported above, RL was slightly more reliable over a three month 
interval than MS (.81 vs. 74). 
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Proportion 
of Responses 


Proportion 
of Responses 
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Inaginary Perceptual Concrete Fornal 
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Wonre 2. Distribution of resmonees ta man test Аспа for subjects 
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TABLE 4 


ation Matriz of Map Score and Reasoning Level with Other Variables 
(San Francisco Sample, N = 270) 


Grade Ethnic Reasoning 
Level Sex Group 10 Level 
= .05 .00 .23* ,A1** 
E -.07 .07 .02 
— .22* A 
-- .58°* 


“Hypothesis 4 predicted that despite differences among ethnic 
oups in MS and RL (both significant), all groups go through the 
“Same set of developmental stages. Figure 4 shows the number of 
Subjects at each reasoning level, with the sample analyzed first by 
"grade level, then by ethnic group. As seen in Figure 4, 5th grade 
"children tend to respond at a concrete level of reasoning, with smaller 
— Bumbers responding predominantly perceptually and formally. At 
| th grade, the number of perceptual respondents decreased slightly, 
he number of concrete respondents also decreased slightly, while 
formal respondents increased from 31 to 41 (out of 90 possible). 
At 9th grade, the number of Ss responding at perceptual and con- 
_ Crete levels continued to decrease, while 56 of 90 Ss tended to re- 
Spond at formal reasoning levels. Thus, the predicted develop- 
mtal changes with increased age were supported by the data, s 
st for the sample taken as a whole. 


en were distributed across reasoning stages 
ical to 9th grade children. (It should be noted that the ваше 
mple was analyzed both for grade level and ethnic group differ- 
з; thus one third of the same subjects were in both sets of 
encies). 
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Figure 4. Number of subjects at each ing level stage analyzed by 
level aac By Өй Оол (М р Заг агр evel stage M 


From the similarity of distributions for each of the three ethnic 
groups to each of the three grade levels, it would appear that Hyp 
thesis 4 was supported. It appears that the differences between 
ethnic groups are not due to fundamental differences in cognitive 
developmental stages themselves, but rather due to the age by 
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which the members of each group have achieved certain reasoning 
skills. 
Discussion 

The purpose of this study was to test the efficacy of using ranked 
distractors for the assessment, of Piagetian reasoning levels. Since 
the data tended to support the hypotheses, and since the stability 
of the instrument was relatively high, it would appear that the map 
instrument, may be capable of measuring reasoning stage levels as 
well as map achievement. 

If, as the results indicate, levels of reasoning do increase with 
grade level, and if the rate of stage to stage development varies, 
the diagnosis of individuals’ developmental levels using ranked dis- 
tractors would appear to have potential importance for educators 
(Adler, 1963; Stone, 1966; Sullivan, 1967; Feldman, 1969, 1970). 
While а great deal of work has been done attempting to replicate 
Piaget's results, particularly in areas such as conservation and ob- 
ject permanence, little seems to have been accomplished in setting 
up standardized batteries for diagnostic purposes. The level of а 
student's or group's cognitive functioning is obviously of impor- 
tance to the teacher—in planning curricula, instructional strategies, 
manipulating the educational environment, etc. Piaget's clinical 
methods are impractical because they require long periods of con- 
centrated observation by trained observers, coupled with the use of 
non-standardized questioning (Ginsburg and Opper, 1969). As а 
Pencil-and-paper measure of reasoning level, the Feldman- 
Salomon technique of using ranked distractors has the advantage 
that it may be utilized in ordinary classroom situations; it Te 
quires a reasonable amount of time (60 to 90 minutes) and does not 
Tequire a highly trained examiner. Сн 

Despite this seeming potential significance 85 8 diagnostic instru- 
ment applied to the classroom situation, it is аз & heuristic device 
that this measuring technique seems to offer the most promise ^ 
this time. Piaget has provided a wealth of material for hypothesis 
Construction and experimentation (Flavell, 1963), but there is also, 
as Ausubel (1967) pointed out, some confusion among educators 
and students who are “bewildered by the overstated pun jd 
4 aget’s supporters and detractors regarding the applicabili ng 

ideas to educational theory and practice." Thus, while the 
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exploration of practical applications of research on sequences of 
stage development theories should continue, there is a danger that 
these applications may be premature extrapolations (Sullivan, 
1967, p. 17): 


One note of caution should be sounded in the treatment of 
Piaget’s observations as a structural theory of intelligence. Pia- 
get’s theoretical model of intellectual development is a rather 
elaborate superstructure which is superimposed on his observa- 
tions. More research is needed to clarify how his ‘structures’ 
account for the observations that he has catalogued. 


Thus, our research will continue to seek data bearing on the fol- 
lowing questions: 


1. What mechanisms affect cognitive development? 
2. How do these mechanisms operate? 


It would seem as if these questions must be understood if educators 
are to optimize progress through stage development. Four possible 
mechanisms have been suggested (Piaget, 1964, Sullivan 1967) 
which, singly or combined, could account for Piaget’s observations 
on stage development; they are: 


1. maturation, 

2. interaction with the physical environment, 
3. social interaction with peers, 

4. equilibration. 


The results reported here suggest that an examination of the 
types of responses chosen or volunteered by children at differing 
levels of development may provide insights into the operation of 
general mechanisms of cognitive development. Turiel (1966, 
1969) and his colleagues have begun to explore this approach in 
their studies of moral development. Item response analysis takes 
on a new dimension in this context, since each response potentially 
reflects a specific level of reasoning. A battery of such instruments 
could be useful in establishing conditions for influencing interstage 
transitions, making comparative analyses of before-and-after treat- 
ment effects, and in providing data necessary for operationally de- 
fining development of cognitive processes. 

It should be noted that the notion of “stage” is itself a heuristic 
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device to describe qualitative differences in reasoning. Cognitive 
development almost certainly does not take place in disjunct stages. 
As data from Rest, Turiel and Kohlberg (1969) and the present 
study indicate, responses tend to cluster about a modal stage but 
do not all fall in that stage. Researchers and educators should be 
cautioned not to categorize an individual into a given stage and 
assume he can respond only at that level. In view of the com- 
plexity of piagetian theory, and in view of the difficulty in render- 
ing cognitive development researchable, the results found in this 
study should be construed only as observed behavioral differences 
which may or may not reflect developmental processes. 
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NOTES ON APPROXIMATE PROCRUSTES 
ROTATION TO PRIMARY PATTERN? 


ESKO KALIMO 
The National Pensions Institute of Finland 


Ax increasingly popular trend in confirmatory factor analysis is 
‘rotating to a maximum fit to a specified factor matrix. This kind of 
Totation to a target matrix, instead of the more usual criterion to а 
“Maximum simple structure, has been called the “Procrustes” ap- 
- proach. The problem was first considered by Mosier (1989), who pre- 
sented a mathematical rationale for oblique rotation to a target ma- 
- rix, expressed as a reference structure. Because he found that the 
"equations were difficult to solve algebraically, he suggested an ap- 
proximate solution for rotating a factor matrix to the best least 
| Squares fit to the target reference structure. The same approximate 
_ Solution was later suggested by Horst (1956) and Hurley and Cattell 
_ (1962). The authors of the last-mentioned article relied partly on the 
В жок of Ahmavaara (1954), who used the same formulas for a closely 
| Telated transformation: for comparing two factor analyses with each 
“other. Lubin (1950) proposed essentially the same method to be used 
With Eysenck's “criterion analysis" (1950). Harman (1967, р. 251) 
refers to this procedure as a general method giving the matrix of 
transformation between any two solutions in the same common factor 
Space. 
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This approximate solution for a maximum fit to а target, matrix 
has been applied to target matrices, expressed as reference structures. 
Since it is often conceptually and computationally simpler to use pri- 
mary patterns, the procedures dealing with reference structure have 
made additional computations necessary to get the final matrix ава 
primary pattern (see e.g., Hendrickson and White, 1964). A direct 
procedure to а primary pattern, when the target matrix is a refer- 
ence structure, will be suggested. Secondly, it will be shown that the 
rotated primary patterns are not usually identical when the target 
matrix is a reference structure and when it is the corresponding pri- 
mary pattern. 

It may be noted that, under the restriction of orthogonal rotation, 
exact solutions for Procrustes rotation have been given by Green 
(1952), Cliff (1966), and Schónemann (1966). Browne (1967) sug- 
gested an exact solution for the Procrustes rotation to oblique refer- 
ence structure. Fischer and Roppert (1964), Browne and Kristof 
(1969), and Gruvaeus (1970) have developed exact Procrustes meth- 
ods to oblique primary pattern, by minimizing slightly different eri- 
terions than Mosier (1939). In spite of these recent developments, it 
seems to be well-founded to examine further the properties of the 
approximate oblique procedure, especially because of its simplicity. 
Neither can this approximate method be abandoned before more re- 


search has been done on the superiority of the exact methods in prac- 
tice. 


Procedures 
Consider an unrotated orthogonal factor matrix for n variables and 
т factors, A (n X m), which we would like to rotate to a maximum 
fit {о a target matrix, В (usually n x m). Let L (m x m) stand for 
the desired transformation matrix and E (n х m) for the matrix of 
differences between the rotated matrix and the target matrix, repre- 


sented by 
Re ALS B. @) 
The sum of squares, tr(HH’), is an appropriate criterion to be 
minimized. As Mosier showed, the transformation matrix which give 
a least squares fit of matrix A to matrix B is given by 
L = (4'4)"A'B. @) 
When the target matrix is a reference structure, V (n X ™), and if 


— 
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we require that the factors have unit variances, matrix L must be 
normalized by columns to give the final transformation matrix, A 
(m x m), which would satisfy the restriction 


diag (M A) = I. (3) 
Then 
AA = V. (4) 


By standard techniques (Harman, 1967), matrix V can be trans- 
formed to the corresponding primary pattern and the correlations 
among the primary factors calculated. The well-known relation be- 
tween the unrotated factor matrix and a primary pattern, P (n xm), 
is expressed by 


Ар) (5) 


in which T- (m x т) refers to the transformation matrix. Its inverse 
T (m х m) gives the direction cosines of the primary factors with re- 
spect to the unrotated factors and is subject to the following restric- 
tion, in the usual ease of factors with unit variances, 


diag (ТТ) = I. (6) 
The well-known relation between matrices T and Ais 


T = DA, e 
in which D, (m x m) refers to a diagonal matrix with elements for 
{в are also the correla- 


normalizing matrix A“ by rows. These elemen 
tions between the corresponding rotated primary and reference fac- 
tors. When equation (7) is expressed in terms of matrix L, we get 


T = DIDI: 8) 


in which D; (m x m) refers to a diagonal matrix with elements for 


normalizing matrix L by columns. This сап be written 
(9) 


T = DD; LE. 
Because the elements of matrix D; are always chosen 80 that de^ 
T can be directly 


trix D71- becomes normalized by rows, matrix 
calculated by 


т = DIL, (10) 


in which D, (m x m) refers to a diagonal matrix with elements for 
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normalizing matrix L-! by rows. The final transformation matrix, T+, 
is given by 


Т = Іру“. (1) 


This formula shows a direct procedure for calculating the best 
fitting primary pattern, when the target matrix is expressed as a ref- 
erence structure. The matrix of transformation to the best fitting 
primary pattern can thus be obtained directly be another simple mod- 
ification of matrix L as the matrix of transformation to the corre- 
sponding reference structure. It is not necessary to proceed through 
the reference structure to be able to express the results as ^ primary 
pattern, for this method gives a primary pattern which is id. пса! to 
that which would be obtained through the reference structure. So, if 
А is first rotated to a reference structure, using (2) and (4), it can 
later be rotated to the corresponding primary pattern either by ap- 
plying the Procrustes method again, i.e., formulas (11) and ‘5), or by 
a transformation of the rotated reference structure using {һе standard 
technique. 

The above result may also be interpreted so that, when the approx- 
imate Procrustes procedure is applied with a reference structure 48 
the target, both the rotated reference structure and the correspond- 
ing rotated primary pattern are in the same sense maximally similar 
to the target matrix. It can be shown in a corresponding way that this 
is true also when we have a primary pattern as the target matrix, 
assuming that matrix L is again modified in the same ways, for pro- 
ducing the final transformation matrices which fulfill the restrictions 
imposed on them by the factor analysis model. However, when the 
exact Procrustes methods are used, the rotated reference structure 
and the corresponding primary pattern are not always in the same 
sense maximally similar to the target matrix (Browne and Kristof, 
1969). 

It сап be shown, however, that when applying the approximate 
Procrustes method, the rotated primary pattern which is maximally 
similar to a target reference structure is not identical with the T0- 
tated primary pattern which is maximally similar to the correspond- 
ing target primary pattern. 

When the target matrix is a reference structure, the matrix of trans- 
formation to a primary pattern can be written according to (9) and 
(11). 
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T^ = LD,D,". (12) 


The well-known relation between a reference structure and the cor- 
responding primary pattern is 


V= PD, (13) 


in which D, refers to a diagonal matrix giving the correlations be- 
tween the corresponding reference and primary factors. According 
to (2) and (13), (12) can be written 


T^ = (A'A)  A'PDASD;. (14) 


It may be noted that D, refers to the correlations between the tar- 
get reference and primary factors, while D; gives the correlations 
between the rotated reference and primary factors. 

When the target matrix is a primary pattern, the matrix of trans- 
formation to the rotated primary pattern is according to (11) 


Т, = LD. (15) 


D; in (15) is not usually equal to DDz in (12) and (14), because 
they refer to normalizations of different matrix L+. By using (2) we 
get 


T," = (A'AY A'PDy*. (16) 


According to formulas (14) and (16), matrices T+ and т,» 
аге not usually identieal However, since the difference in their 
formation ean be expressed as a diagonal matrix, the correspond- 
ing columns of Т, and T, are proportional to each other. The re- 
sulting primary patterns are thus different depending on the form 
of the target matrix, but also their corresponding columns are pro- 
Portional to each other. These pattern matrices differ generally only 
little from each other, leading in most cases to the same substantive 
conclusions. 

When the target matrix is an orthogonal matrix and if we do not 
impose the restriction to orthogonality on the rotated factors, the 
Procrustes rotation can be carried out by the formulas given above 
either to a reference structure or to the corresponding primary pat- 
tern, depending on the preference of the factor analyst. Applications 
of the approximate Procrustes rotation with orthogonal target ma- 
trices are published elsewhere (Bice and Kalimo, in press). 
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Summary 


After a brief review of rotation methods to а specified target ms- 
trix in factor analysis, two notes were presented. The notes dealt with 
the approximate Procrustes method, which rotates a factor matrix 
to an approximate least squares fit to a target matrix. The first note 
suggested a direct procedure for obtaining a primary pattern, when 
the target matrix is a reference structure. On the basis of this result 
it was also concluded that the resulting primary pattern and the cor- 
responding reference structure are in the same sense maximally sim- 
ilar to the target matrix. The second note showed that usually num- 
erically different but interpretatively similar primary patterns are 
obtained when the target matrix is a reference structure and when it 
is the corresponding primary pattern. 
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COMMUNALITY ESTIMATION IN FACTOR 
ANALYSIS OF SMALL MATRICES 


EDWARD E. CURETON 
University of Tennessee 


Nowapays factor analyses of medium to large matrices are done on 
electronic digital computers. But small matrices can still be factored 
efficiently by the complete centroid method on desk calculators. It is 
not commonly noted, however, that the required accuracy in com- 
munality estimation increases as the number of variables decreases. 
Yet in centroid analysis, this fact is fairly obvious. In a small ma- 
trix, the error in the estimate of а diagonal entry is а substantial 
fraction of the column sum on which the factor loading is based. In 
& large matrix it is not, 

In all too many factor analyses of small matrices, the investigators 
take as initial estimates of communality the absolute value of the 
numerically highest correlation in each column (|r| max), and re- 
peat this procedure with each residual matrix. In so doing they over- 
look or ignore Thurstone’s explicit warning (1947, p. 300), "This sim- 
ple method of estimating communalities is useful only for large 
correlation matrices. It is not, applicable to small tables." 

Of the more elaborate methods of communality estimation which 
are still within practical limits for desk calculators, the one most com- 
monly recommended is the “miniature centroid" method (Medland, 
1947; Thurstone, 1947, p. 300, eq. 15). Each variable is grouped with 
the two to four others with which its correlations are highest, the 
highest correlation in each column is placed on the diagonal, and the 
communality of the variable is taken as the square of its first centroid 
factor loading. With small matrices, this method also is unsatisfac- 
tory. It assumes that each miniature centroid matrix will have rank 
almost unity, but in small matrices it often happens that the two to 
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four variables with which a given variable correlates highest do not 
form with it a submatrix of rank close to unity. 
If two variables have vectors which are close together in the com- 
monfactor space, their correlation will be close to the geometric mean 
of their communalities. The correlation will underestimate one com- 
munality and overestimate the other. Cattell (1952, pp. 154-5) as- 
eribes to Burt (1940) the suggestion that if the column sum for & 
given variable is high, the highest-r estimate of communality is likely 
to be too low, while if the column sum is low, the highest-r estimate 
is likely to be too high. Cattell noted also that the dispersion of the 
highest r’s is usually less than that of the final computed communal- 
ities. These observations are the basis of the method of initial com- 
munality estimation proposed here. x 
If each highest r is weighted by the corresponding column sum and _ 
divided by the sum of all column sums, the dispersion of the result- 
ing estimates is about the same, on the average, as the dispersion of 
the final computed communalities. But in some individual cases the 
weighting correction goes in the wrong direction, and this weighting 
system gives marimum discrepancies (between estimated and final 
computed communalities) which in general are larger than those 
given by the unweighted highest-r method. We need to weight the 
highest r's by quantities proportional to the column sums, but with _ 
less relative variability than that of the column sums themselves. — 
Two additional points need to be noted. First, if the correlation 
matrix contains negative entries, even if not enough to require reflet- х 
tion, е column sums needed for weighting are the sums of the abso- 
lute values of the correlations in the columns, not their algebrai 
sums. A variable with a low algebraic sum but a high absolute sum 
will have a low loading on the first factor, but high loadings on onê 
or more later factors, and hence a high communality. Second, it is 
common lore that, on the average (over different factor analyses), 
the sum of the highest 1’s gives a slight overestimate of the sum 
of the final communalities. Some empirical evidence, presen 
later herein, suggests that the total final communality is, ° 
{һе average, about 96 per cent of the sum of the highest ys, The 
proposed procedure is, then, as follows: І 
1. In the correlation matrix with empty diagonal, find the absol 
column sums, X' |r|. (3' designates a sum of n — 1 quantities, where 
is the number of variables.) І 
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2. Add to each ¥ |r| the mean of the n values of 3' |r|, and call this 
sum S. 

3. Record the absolute value of the highest r in each column, |r| 
max. 
4. Form the products S|r| max. 
5. The estimated communalities are then given by 


.96 r| max 
С = S |r| max (Re tee). (1) 


Here 3 represents a sum of n quantities. 

6. Estimate the reliabilities of the variables. If they are tests, and 
no better estimates are available, use Lord’s (1959) empirical esti- 
mate of the standard error of measurement, from which 


ти = 1 — .187Ё/з, (2) 
where Ё is the number of items and s? is the variance of the scores. 
Then in any case in which the communality as estimated by (1) 
exceeds the estimate of reliability, substitute the latter. 


Empirical Verification 

In order to test the usefulness of (1), factor analyses of ten small 
matrices were studied. ; 

1. Harman (1967, pp. 88-9) gives intercorrelations among s km 
pothetical variables which yield exact two-factor oroniunalitee. 

2. Harman (1967, pp. 147, 154) also gives rie 
eight physical measurements, and а principal-axes factor ; м E 
two factors, starting with computed communalities from the o 
study, which included 17 variables, by Mullen. (1999). The Јанев, 
discrepancy between any estimated communality and the final com- 
puted communality was .058. 

3. Cureton et al. (1944) give intercorrelations among five verbal 
tests. The writer factored them by the centroid method to two factors, 
starting with communality estimates based on & tetrad-triad nai. 
Sis. The largest discrepancy between any estimated communality an 
the final computed communality was .004. А 

4. Harman (1967, рр. 178, 186) gives intercorrelations among rd 
teen psychological tests, and a centroid factor analysis to three fac- 
tors, starting with computed communalities from а previous bi-factor 
solution which included 24 tests. The largest discrepancy between any 
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estimated communality and the final computed communality was 
445. 

5. Lawley (in Thomson, 1951) gives intercorrelations among eight 
tests, and a maximum-likelihood factor analysis converging to exact 
sample communalities for two factors. The Lawley test did not reject 
the two-factor hypothesis at the .05 level, but did reject it at the .10 
level. The writer, therefore, extracted a third factor by the centroid 
method, using column means after reflection (see later discussion) as 
residual communality estimates. This third factor was of doubtful 
significance, with highest loading .206 and total added communality 
116, but it was retained for purposes of comparison of communality 
estimation methods, because with it the total communality was 94.8 
per cent of the sum of the highest r's, while without it the total com- 
munality was only 91.8 per cent of the sum of the highest r's. 

6. Lawley and Maxwell (1963, pp. 33-40) report intercorrelations 
among six sets of school grades, and a centroid factor analysis to two 
factors, repeated three additional times to improve the communality 
estimates. The Lawley test did not reject the two-factor hypothesis 
at the .50 level. 

7. Lawley and Maxwell (1963, pp. 17-20, 24-27) report also inter- 
correlations among nine variables, and maximum-likelihood solutions 
for two and three factors. The Lawley test rejected the two-factor 
hypothesis at the .10 level but just failed to reject it at the .05 level. 
The three-factor hypothesis was not rejected at the .80 level. The 
three-factor solution was therefore used for our comparisons. 

8. Swineford (1948) reports intercorrelations among nine tests, 
and gives a bi-factor solution. The writer factored this matrix to three 
factors by the centroid method, using Swineford’s computed bi-factor 
communalities as initial estimates, and then repeated the centroid 
analysis six times. There was some doubt concerning the possible 
significance of a fourth factor, but a four-factor principal-axes solu- 
tion, repeated once, was rejected because it gave one communality of 
^ for a test whose reliability as estimated by Swineford was only 

9. Morrison (1967, pp. 243, 273-5) , reports intercorrelations among 
the lengths of six chicken bones from a study by Wright (1954), 
gives a maximum-likelihood factor analysis to two factors. The two- 
factor hypothesis was rejected by the Lawley test at the .001 level, 
so the writer extracted a third factor by the centroid method, using 


highest third-factor loading was .206, all others were below 170, and 
the added total communality was .092, so it seemed clear that в 


fourth factor must be insignificant. 
10. Fruchter (1954), pp. 72-85) reports intercorrelations among 
| eleven Air Force tests, and a factor analysis by the centroid method, 
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column means after reflection as residual communality estimates. The 


using Thurstone's procedure of estimating communalities by highest 
F's and re-estimating by the same method in every residual matrix. 
Using several of the older approximate methods for determining the 
number of factors, he concluded that five were significant. Since he 
had only eleven variables, the writer suspected that he might have 
over-factored this matrix because of the highest-r re-estimates in the 
residual matrices. The writer therefore re-factored his matrix by the 
principal-axes method, taking as initial communality estimates val- 
ues proportional to the squared multiple correlations but with sum 
equal to the sum of the highest r's. A residual communality was re- 
placed only if it was lower than one-half the mean absolute value of 
the off-diagonal entries in the column, in which case this latter value 
was used. It appeared that three factors were probably sufficient. The 
sums of squares of the factor loadings (eigenvalues for the principal 
&xes solution) were as follows: 
Factor 1 2 3 4 5 6 7 


Cir — à 01 2323 eee 
Eigenvalue 3.039 1.328 .666 .180 .132 
УР centroid 3.018 1.293 .741 .248 .220 0% .074 


Eco 09е O LLLI 


For the principal axes solution, the fourth eigenvalue is lower pum 
the fifth 37 of the centroid solution, and the fourth ne 
factor had only two loadings above .20, neither of рако eal 
as 22, With three principal-axes factors, moreover, the largest die 
crepancy between an estimated communality and the final computed 
tommunality was less than .03. Communalities based on these three 
factors were therefore used in the comparisons. 

Due to very large N(8, 158), more than three factors —€— 
tainly have been statistically significant in Fruchter's data, and we 
Шау well have been unduly conservative in stopping at three. es 
the other hand, a factor of quite doubtful significance was added to 

ИЙ the chicken bone data and another not much more significant to the 
| Lawley-Thomson data, so perhaps for the purpose of ы 
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comparison we are justified in being conservative as to the number of 
factors in one of the ten samples. 

The comparisons are shown in Table 1. The first three lines show 
the number of subjecta N, the number of variables n, and the number 
of factors m, for each of the ten studies. We designate а final com- 
puted communality as h?, an estimated communality by (1) as C, 
and a highest-r estimated communality as т. 

The fourth row of Table 1 shows XÀ?/Xr for each sample. The 
mean of the ten values is .962, and the standard error of this 
mean is .007. We conclude that for small matrices, the total commu- 
nality is roughly 96 per cent of the sum of the highest r’s; hence the 
factor .96 in (1). 

The next three rows show the within-sample ranges of the hs, the 
C's, and the r's. For all samples but one, the range of the A?'s is great- 
est, the range of the C's is next, and the range of the r's is least. In the 
one exception, underlined, the range of the C’s is greater than the 
range of the h?'s, 

The next two rows show the sums of absolute values of the dis- 
crepancies between hê and C, and between h? and r. In all cases 
except the one underlined, the sum of absolute values of the dis- 
crepancies is smaller for the C's than for the r’s. By the one-sided 
sign test, this result is significant at almost exactly .01. 

The last two rows show for each sample the one largest dis- 
crepancy between an h? and a C, and the one largest discrepancy 
between an h? and an т. In seven of the ten samples, the largest C- 
discrepancy is smaller than the largest r-discrepancy. In the other 
three cases, the C-discrepancy is underlined. While this result is far 
from conclusive for ten samples, it is at least indicative (signifi- 
cant at the .17 level by the one-sided sign test). 

We conclude that the Burt-Cattell recommendation has been re- 
duced to а fairly serviceable fixed procedure, and that (1) is to be 
preferred to the highest-r method for initial communality estima- 
tion. As compared to the latter, the use of (1) reduces the mean 
absolute value of the discrepancies between estimated and final com- 
puted communalities, and does not in general increase the size of 
the largest discrepancy. 


Re-estimation 
It is highly probable that Thurstone’s caution was directed pri- 
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marily toward re-estimation by highest r's rather than to initial esti- 
mation by this method. 

If we start a factor analysis of fallible data, knowing in advanee 
the number of significant factors, and with communalities which are 
exactly correct for that number of factors (as, e.g., by re-factoring 
until the communalities are completely stabilized), and if we now 
factor with no revision of diagonal residuals, the following effects 
will be observed: , 

1. In each residual matrix, the diagonal elements will in general 
be smaller, relative to the off-diagonal elements, than in the previous 
matrix. 

2. In the residual matrix after the last factor, the diagonal ele- 
ments will all be exactly zero (within rounding error) . 

In each successive matrix, the real common variance is reduced, 
but the error variance is merely re-shuffled, The residual matrix after 
the last factor should have diagonal entries comparable in absolute 
value to the side entries if it is in fact an error matrix. About half of 
them should be positive and about half negative, with mean closer 
to zero than the individual entries. Re-factoring to complete sta- 
bility represents over-fitting of the factors, with all error variance 
forced into the side entries in order to permit the final diagonal 
entries to be all exactly zero. This procedure tends to inflate the 
range of the communalities, leading occasionally to an artificial 
Heywood case, and even more often to a quasi-Heywood case 
with one communality larger than the corresponding reliability. 

In the correlation matrix, the initial communalities are comparable 
in magnitude to the highest r's. In the last residual matrix from which 
a factor is extracted, they should be comparable in magnitude to 
the means of the entries in the columns, but must all remain positive. 
And in the residual matrix after the last factor, about half of them 
should be negative. 

Since small matrices seldom have more than four or five factors, 
the following three rules will probably be sufficient: 

1. Make initial communality estimates by (1). 

2. For factors other than the first factor and the last factor retained, 
consider each diagonal residual, the absolute value of the hight 
residual correlation in each column, and each column mean, exclu- 
sive of the diagonal entry, after reflection. Choose the median 0 
these three as the estimate of each residual communality. 
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3. For the last factor retained, estimate all residual communalities 
as the column means, exclusive of diagonal entries, after reflection. 

Rule 2 says in effect to use the diagonal residual unless either the 
highest r is lower or the column mean is higher. If the highest r is 
lower, the original estimate by (1) was probably too high, and if 
the column mean is lower, it was probably too low. 

Rule 3 recognizes the point that by the time the last factor to be 
retained is reached the highest r’s will be too high, and the diagonal 
residuals will consist mainly of the errors of estimation in the origi- 
nal communality estimates plus the errors of correction by Rule 2. 

While these rules are rough, they do provide for relatively decreas- 
ing diagonal residuals in successive matrices, and permit some nega- 
tive diagonal entries in the residual matrix after the last factor re- 
tained. 


On the Number of Factors 


The problem of the number of factors to be retained is usually less 
troublesome with small matrices than with larger ones, and the choice 
is commonly reduced to a choice between m and m + 1. No single 
rule is sufficient, but a judgment based on several will usually be 
correct. 

1. A factor is likely to be significant if it has one loading as high 
as 30 and at least one other as high as 20, or if as many as one- 
fourth of its loadings are as high as .20, or if one-fifth are as high as 
20 with at least one as high as 25. 

2. A factor is likely not to be significant if the bighest and the 
second highest loading of every variable are on preceding factors. 
‚ 8. When the last factor is reached, the sum of squares of all load- 
ings will usually reach 95 per cent of the initial trace (the sum of the 
estimated communalities). If this sum exceeds 105 per cent of the 
initial trace, the factor is probably not significant. 

4. If N is at least 100 but less than 800, the Burt standard error 
of а factor loading (Burt and Banks, 1947) may be of some value. 
This test may be simplified by incorporating into it the hypothesis 
that the true loading is zero, yielding 


e, = V/n/[N(n — m = 1)], (3) 


Vhere the m-th factor is the one whose significance is to be esti- 
mated. The factor will usually be significant if more than one-fourth 
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its loadings exceed 20, or if at least one-fifth exceed 20, with one 
larger than Зоу. If № is less than 100, a factor may be significant 
even when the Burt test says it is not, and when N exceeds 800, & 
factor may not be significant even when the Burt test says it is, In 
using the Burt test, estimate all diagonal residuals by Rule 3. The 
Burt test is about as good as any of the other approximate tests, 
and is easier to apply than most of them. 

These tests are all, or almost all, inconsistent with one another. 
Nevertheless, if all of them are considered, the investigator cam | 
usually arrive at a correct conclusion. 
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ВЕРОВТЕРр correlations among variables sampled from interest, 
personality, and ability measures, although often statistically sig- 
nificant, are usually too low to be of practical predictive value. The 
question of why the common variance among these three domains 
has been so low is intriguing. It was the thesis of this writer that the 
adoption of a hierarchical trait structure as a conceptual model 
Would lend possible clarification to this question. Such a model 
Postulates the existence of “higher-order” traits which account for, 
or explain, the covariation among traits of а “lower” level. Such 
higher-order constructs ean conceivably have greater generalizabil- 
ty (Coan, 1964), and greater explanatory power (Royce, 1963) 
than the constructs at the next lower level. One purpose of this study 
Was to explore the possibility of the existence of such а hierarchical 
ttait structure through the use of a higher-order factor analytic 
Procedure, 

In this procedure the common variance among domains at the 
Variable level can be accounted for in two ways. The first is through 
the composition of the first-order factors, e.g. individual first-order 
factors would be highly loaded on variables from separate оташе, 
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However, the findings of Bendig and Meyer (1963) suggest that 
interdomain relationships will not be accounted for in this way, if 
the degree of relationship among variables from different psycho- 
logical domains is less than that which exists among variables from 
within the same domain. The findings of Bendig and Meyer, Ander- 
son and Anker (1964), and Becker (1963) would suggest that this 
pattern exists among the intercorrelations of interest, personality, 
and ability variables. 

Implicit in Bendig and Meyer’s rationale is the notion that the 
interdomain variance at the variable level will instead be accounted 
for through the existence of correlated first-order factors. This is 
the second way in which the relationships between domains can be 
accounted for. 

If the Bendig and Meyer thesis is correct (that a factor analysis of 
intercorrelations among variables from separate domains will yield 
correlated first-order factors, each of which being defined by & 
single domain), this would suggest a possible answer to the ques- 
tion posed above. The reason for the low intercorrelations among 
variables from these separate domains may be a result of dealing 
with constructs which are not of a “high” enough order. It would 
suggest the existence of constructs, specific to the three domains, 
that have greater generalizability and explanatory power than the 
constructs we are now measuring. If the correlations between such 
factors were of sufficient magnitude, it could possibly provide 
more satisfying evidence regarding hypothesized relationships among 
these three domains. Also, it would then be possible to obtain 
second-order factors which could be representative of a level of con- 


d that would serve to further integrate the domains in ques 
on. 


The Strong Vocational Interest Blank-Form M. (SVIB) and the. 


Minnesota Multiphasic Personality Inventory (MMPI) were se 
lected to represent the areas of vocational interests and pe і 
because of their extensive use in counseling and research. In selecting 
these instruments on this basis % was recognized that their use inà 
factor analysis was somewhat questionable. As Guilford (1957 
pointed out, the interdependent nature of the scales in these two im 


struments results in partially “built-in” correlations among BE 


spective scales. This built-in common variance among scales may, 
may not, be an accurate reflection of the degree of relationship 


> 
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the variables in question for a given sample. Since the validity of this 
built-in variance must be held with some doubt for а given sample, 
any factor structure of these instruments must also be viewed with 
some degree of skepticism. 

To what extent should one be concerned with this problem? This 
would be a function of the extent to which the observed correlations 
among the scales are a function of the empirical interdependence 
among the scales. If the built-in variance is extensive then one 
would, of course, be more concerned than if it is minimal. Therefore, 
asecondary purpose of this study was to determine the extent of the 
built-in correlations among the scales for both the MMPI and SVIB. 

Havlieek (1965) and Shure and Rogers (1965) have pointed out 
that another consequence of these built-in relationships may be а 
distorted and an erroneous picture of factor stability. In order to 
investigate this question it was decided to determine the built-in 
factor structure of the MMPI and the SVIB and compare this 
structure with a structure based on a sample of subjects. If there 
were a good deal of similarity between the built-in factor structure 
and the sample structure, the hypothesis of an erroneous notion of 
factor stability would become more plausible. If there were little 
similarity between the two structures, it would lead to the inference 
that the independent portions of the scales are of greater influence 
їп determining the factor structures of the instruments and the no- 
tion of a built-in faetor stability would become less plausible. 


Analysis of Sample Data 
Subjects 


The Ss of this study were those applicants for the Summerfield 
Scholarship at the University of Kansas who were chosen as final- 
ists for the academic years, 1960-1961 (N = 46); 1961-1962 (N = 
" Te Үз (N — 46); 1963-1964 (N = 40); 1964-1965 (N = 
2 1965-1966 (N = 34) 1966-1967 (N = 83); 1967—1968 (N= 
). The Summerfield Scholarship is the highest academic honor 
lat is awarde d to a male undergraduate by the University of 
Kansas, These 400 candidates constituted the total number of 

ummerfield finalists, from the academic year of 1960-1961 through 
: * academic year of 1967-1968, on whom the necessary data for 
his study were complete. 


matics Test (SMT). This latter test is a locally devised instrument 
which is used in the selection procedures for determining the re- 
cipients of Summerfield Scholarship awards. The SVIB and MMPI 
data were in standard score form, and the CMT and SMT data 
were in raw score form. 


Procedure 


Prior to the factor analytic procedures, the distribution for each 
variable was tested for normality; a test for nonlinearity of regres- 
sion was performed for every possible variable pair (Guilford, 1965, 
pp. 308-316), and Bartlett’s test (Bartlett, 1950) was used to evaluate 
the significance of the study's original 60 x 60 intercorrelation 
matrix. 

The factor analytic procedure for extracting higher-order factors 
consisted of the following: 
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Instruments 

The instruments which were used in this study were 45 scales of 
the SVIB-M, the three validity and 10 clinical scales of the MMPI, 
Terman's Concept Mastery Test (CMT), and the Stouffer Mathe- 


a. First, Kaiser and Caffrey’s (1965) alpha factor analysis was реї- 
formed on the original 60 x 60 intercorrelation matrix. 

b. This was followed by the orthogonal rotation of these alpha 
loadings to an approximate simple structure position using Kaiser's 
normal varimax criterion. 

с. These rotated varimax loadings were then further rotated to 8 
maximum simple structure position using Eber's (1966) Maxplane 
oblique rotational solution. 

d. Three criteria, as suggested by Cattell (1900, pp. 188-189), 
were used to judge the adequacy of the final solution, including Barg- 
mann’s (1953) significance test for simple structure. 

e. The resulting correlations between factors were then evaluated 
to determine whether they were significantly different from zero, 
using Bartlett’s test of significance. 

f. If the correlations proved to be significant, an alpha factor analysis 
was performed in order to extract second-order factors. 


The same sequence of procedures, as listed above, was followed 1 
to complete ће second-order analysis. | 
After the completion of the second-order analysis, a Cattell- f 
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White transformation (Cattell, 1966, pp. 219) was performed to 
obtain the loadings of these higher-order factors on the original set 
of 60 variables. 


Procedures for Determining the Extent of 
Built-in Correlation Due to Item Overlap and K-Correction 


In order to assess the extent of the built-in correlation due to item 
overlap, 2500 randomly answered MMPI’s and SVIB’s were scored. 
The probability of occurrence of each of the three response alterna- 
tives (L, I, D,) on the SVIB was set at .333. The probability of 
occurrence of a true or false response on the MMPI was .5. Thus, each 
response alternative on the SVIB, and on the MMPI, had an equal 
probability of endorsement, i.e. any one of the thousands (3*% and 
25%) of unique response patterns possible on both instruments had 
an equal probability of occurrence as any other pattern of responses. 
The standard scores of these 2500 randomly answered inventories 
were then intercorrelated. The CMT and SMT were not in- 
cluded in this intercorrelation matrix because of their empirical in- 
dependence of all the other scales. 

It should be recognized that the method for determining the ex- 
tent of built-in correlation among scales is representative of an over- 
all “average” built-in effect. In order to calculate the built-in effect 
operating for a given sample, one would have to obtain the distribu- 
tion of endorsements for each response alternative for each item. The 
probability of occurrence of each response alternative for а given 
item would then be based on the distribution of endorsements for 
that item rather than being set at an equal probability of oceur- 
rence. Time did not permit the determination of these endorsement 
distributions for the Summerfield sample. 

In order to obtain that first-order factor structure which would be 
entirely based on built-in scale interdependency, the same procedures 
as outlined above were performed on the correlations derived from 
the 2500 randomly answered inventories. Ahmavaara's (1954) pro- 
cedure for comparing factor structures was used to determine how 
similar the built-in factor structure was to the factor structure based 
on the responses of 400 Summerfield finalists. 


? The results of the Cattell-White transformation and a third-order anais 
can be obtained from the author by request. 
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Results 


Ten alpha factors having positive generalizability were extracted 
from the “Summerfield” 60 x 60 intercorrelation matrix, accounting 
for 76.70 per cent of the original interscale variance. Table 1 presents 
the factor pattern matrix obtained from Maxplane. This solution 
was based on finding that reference vector structure which yielded 
a maximum 20 hyperplane (+ .10) count. Out of 600 (60 x 10) 
reference vector coefficients, 64.2 per cent were .10 or less and 74.7 
per cent were .20 or less in absolute value. Table 2 presents the 
intercorrelations among these 10 first-order alpha factors. 

It is immediately evident from Table 1 that there are several in- 
stances in which a first-order factor is loaded on variables from differ- 
ent domains. However, on closer inspection of these loadings, there is 
a rather high degree of independence among the domains. In order 
to demonstrate this independence, the loadings above an absolute 
value of 40 have been italicized in Table 1. The figure is an 
arbitrary one, but the investigator believes this cut-off point provides 
a clear and meaningful presentation of the pattern of loadings, and 
yet is not so high as to delete loadings of practical significance in 
inferring the identity of a factor. Using this frame of reference, 
first-order factors IV, V, and VII are defined by MMPI variables, 
and the seven remaining factors are defined by SVIB variables. 

It is also noticeable that the loadings on the two ability vari- 
ables, CMT and SMT, do not achieve any appreciable magnitude 
(+ 40) across all ten first-order factors, and therefore there is an 
absence of what might be termed an “ability factor.” The com- 
munalities for these two variables were only .2497 and .1183, respec- 
tively, indicating that the major portion of their variance was Spe- 
сїйс (rather than shared) variance. Thus, a large portion of the 
variance of these two variables remained independent of the other 
two domains and unaccounted for by the common factor strueture. 

In Table 2 it сап be seen that the analysis did result in correlated 
first-order factors with the exception of a few factor pairs which 
approach orthogonality. However, of the three “personality factors” 
(IV, V, and VII) only factor VII is substantially correlated with the 
seven "interest factors." Of the 10 first order factors, factor үп 
accounted for the least amount of variance. Factors IV and V are 
essentially orthogonal to these seven factors. 
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Four alpha factors were extracted from the correlations among the 
first-order factors. These four factors accounted for 56.85 per cent 
of the total variance expressed by the original 60 x 60 intercorrela- 
tion matrix. It should be noted that the final communality value for 
first-order factor VII was 1.12194, resulting in а Heywood case 
(Heywood, 1931). It was decided to allow this communality value 
to converge to a value greater than one on the basis of the ad- 
vocacy of Kaiser and Caffrey (1965, pp. 8-9) who present the ration- 
ale that this procedure “implies only that the variable’s unique part 
has negative variance—its unique factor scores are imaginary." 

Table 3 presents the factor pattern matrix derived from the Max- 
plane rotation of the four second-order alpha factors. The per- 
centage of variables (first-order factors) attained in the 20 hyper- 
plane for this solution was 57.50. 

Looking at the second-order factor pattern matrix as presented in 
Table 3, it is evident that in terms of factor composition the two 
areas of vocational interest and personality remain largely separate 
and distinct, with the exception of second-order factor Ш. Second- 
order factors I and П are entirely defined by first-order factors which 
have been previously defined as interest factors. Second-order factor 
IV is almost a complete reflection of first-order factor IV, а person- 
ality factor. Only second-order factor YII has substantial loadings оп 
first-order factors from both domains. 

Space limitations made it necessary to condense the 58 X 58 
matrix of intercorrelations based on 2500 randomly answered SVIB 
and MMPI inventories. Table 4 presents the frequeney distributions 
of the built-in intercorrelations among the 45 scales of the SVIB. 
The interval size of .04 represents two standard errors of mea- 
surement (Guilford, 1965, p. 162). 

Table 5 presents the intercorrelations among the K-eorrected 
standard scores for the 13 basic MMPI scales that were based on 
2500 randomly answered inventories. 

Of the 585 (45 x 13) between-instrument correlations, only 
9 (L5 per cent) exceeded zero by more than two standard errors of 
measurement. The largest between-instrument correlation was 07. 
x Fourteen alpha factors were extracted from the 58 X 58 built-in 
intercorrelation matrix, accounting for 51.66 per cent of the original 
variance. Table 6 presents the factor pattern matrix obtained 
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TABLE 2 


Intercorrelations of Ten First-Order Alpha Factors from Eber’s Mazplane Oblique 
Solution Based on a Factor Analysis of the CMT, SMT, MMPI and SVIB-M 
for 400 Summerfield Finalists 


First-Order Alpha Factors 
v Т ҮП VOI 


from the Maxplane solution. The percentage of variables attained in 
the 20 hyperplane was 76.35 per cent. 

As would be expected, because of "chance" between-instrument 
correlations, the composition of each “built-in factor” is restricted to 
variables from either the SVIB or the MMPI. On inspection, many 
of the interest factors are predominately defined by a particular 
occupational grouping, e.g. factor III is defined by Group V, Social 
Service Occupations. With respect to the MMPI factors, one can 
justify the groupings of variables on each of these factors in terms 
of the variables sharing a K-correction, the sharing of a large num- 
ber of items, or both (Dahlstrom and Welsh, 1960, p. 82). 


TABLE 3 


Factor Pattern Matrix for Four Second-Order Factors from Eber's Maz- 
Oblique Solution Based on a Second-Order Analysis 
of the CMT, SMT, MMPI and SVIB-M for 400 
S raped 


imerfield Finalists 
= ee > roo 
First-Order 
Alpha Factor I п ш IV 
I —.T152 
п —.5752 .8998 
ш .4632 —.4127 
IV .8088 
M .4274 
VI .6405 
Vil —1.1101 
ҮШ — .6441 
ІХ .6493 — .6337 — .5263 
x — .9045 


Note.—Loadings less than +.40 have been deleted for clarity of presentation. 
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^r 
| TABLE 4 
© Distribution of Correlations among Scales of the SVIB Based 
on 2500 Randomly Answered Inventoreis 
Frequency Interval Frequency 
1 — .05-— .08 60 
1 —.09-— .12 67 
2 -.13-—.16 49 
1 —.17-—.20 45 
6 -.21-—.24 42 
8 —,25-—.28 29 
4 — .29-—.32 7 
13 — .33-— .36 12 
.45-. 9 = .37-— .40 1 
Al-.44 26 —.41-—.44 5 
4 37-.40 21 — ,45-— .48 2 
ШО .83-.36 28 —.49-—.52 2 
A 29-.32 37 —.53-—.56 0 
.25-.28 45 — .57-—.60 1 
21-.24 46 —.61-—.64 2 
17-.20 59 =e 
13-.16 51 
09-.12 65 
05-.08 57 


Г 
ШО TOTAL 480 TOTAL 350 
4 ЖЕ с did sot алш цыгканә from эне ab the 6 Jl НН te 00 


_ able 7 presents Ahmavaara's transformation, comparing the 
_ built-in oblique first-order factors with the Summerfield oblique 
first-order factors. 

From Table 7 the reader will notice that there are several instances 
— Where a built-in factor would be considered to be quite similar to 
_ that of a sample factor. These results suggest the inference that the 


TABLE 5 Е 
lations Between Scales of the MMPI Based on 2500 Randomly Answered Inventories 
Se Ма 
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TABLE 7 


Ahmaraara's Transformation Comparing Buill-in-Factors with Sample-Factors 
Based on Oblique Factor Pattern Matrices 


Sample-First-Order Factors 
IV ү 


VI Vil IX 


1 

Dd 

| 

8 
BEN 


J ! 1 ! 
EKE 
8EEEBBE 
! КТ « 
ease 


CCEPEERELELEEE 
ND EE WP 
S?5888525888BRRP 
E 
1 


SBRNSSBSERZESBE 
l 
8 
БЕСЕЕЕЕЕЕ ЕН 
| 
s 


= —.04 ‘ s —.23 .06 

—.47 —.10 —.01 —.26 —.26 

t = 11 —.29 - —.06 —.16 M 
-.0 - .00 05 —.11 —.08 

es —.10 —.24 84 —.04 .01 

.92 10 — -.12 — .01 

.00 


Note.—The coefficients in this comparison matrix represent the loading of a built-in factor on a sample 
where the former has been transformed into the common factor space of the latter. 


factor structure of “pooled” MMPI and SVIB variables may be 
largely built-in, in that several quite similar factors emerge even 
when these two inventories are responded to in a completely random 
manner. 


Conclusions 


The general conclusion of this study, disregarding the issue of in- 
terdependent scales, would be that the psychological realms of vo- 
cational interest (SVIB), personality (MMPI), and ability (CMT 
and SMT) have again been demonstrated to be, to a surprising 
extent, independent and unrelated. Each of the first-order factors 
was defined largely by a single domain, ie. by either interest or 
personality variables. The two ability variables proved to be not 
only independent of the other domains but also of each other. At 
the second-order level there was only one factor which loaded on 
both interest and personality first-order factors. The writer is of the 
opinion that the results of this study do not provide enough evidence 
to support the previously described hierarchical trait model with 
any substantial degree of confidence. 

When the possible implications of the effect of scale interdepen- 
dency are considered, one is left with uncertainties and questions 
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rather than conclusions. It would be erroneous to conclude that, 
because of the demonstrated similarity between the sample structure 
and the built-in structure, the sample structure is largely a reflec- 
tion of а methodological artifact rather than stable personality and 
interest dimensions. The opposite conclusion could be ав easily 
reached. For example, in the case of the MMPI overlapping items 
were introduced because they discriminated between а normal group 
and two (or more) maladjusted groups. Therefore, overlapping 
items between two scales can be viewed as a possible measure of а 
dimension which is common to the two maladjusted groups jn ques- 
tion, but which is not common to “normals.” Might not these psy- 
chological dimensions be of a more fundamental and stable nature 
than those dimensions which represent the differentiation of only à 
single maladjusted group from normals? If one is to consider the 
built-in relationships among scales as completely meaningless for а 
iven sample of subjects, then the assumption has to be made that 
the ability of overlapping items to discriminate between “normal” 
and “maladjusted” subjects is not valid for that particular sample 
of subjects. Is there any reason to suspect an overlapping item to be 
less valid across populations than a nonoverlapping item? 

Although no definitive conclusions ean be reached regarding the 
validity of built-in relationships for a given sample, the degree to 
which reported correlations among the scales of the MMPI and SVIB 
may be erroneous indices of relationship should at least be recog- 
nized. One might also speculate that the built-in relationships 
would have been even greater, and the similarities between the 
built-in structure and the sample structure more numerous, if the 
built-in correlations had been calculated on the basis of the distri- 
bution of endorsements given by the Summerfield sample, rather 
than assuming an equal probability of endorsement for each item 
alternative, 
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WILAON Н. GUERTIN 


g TYPING SHIPS WITH TRANSPOSE FACTOR ANALYSIS 
| University of Florida 


ogniring the possibility of transposing the data matrix before ister- 
correlating columns and factor analysing. When the transposed 
rows become columns, the intercorrelation of columns is an inter- 
correlation of people. 


plicates clusters of these variables in test-space. When the eor- 
relation is between people the analysis is of peoplespace instead 
© of test-space. Therefore, we can вау that the transpore factor 
anilysis explicates clusters of these persons in people-epace. 

With superficial classification the correspondence between per- 


In such a system the classification of erook or nonerook can be 
made from а knowledge of a person's one attribute, like honesty. 
If dishonesty and intelligence enter the classification scheme es 
dichotomous determinants there are four classes but only two trait 
measures. With multivariate trait bases for classification the num- 
ber of type-factors and trait-factors will not necessarily be equal 
nor will there be an obvious correspondence between two kinds of 
factors. 

Burt (1940) exaggerated the correspondence between the con- 
ventional and transpose factor analyses. He placed such conditions 
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on the correlation matrices that his correspondence is obviously 
mathematical and of theoretical rather than practical importance. 
The conditions are that: raw scores be normalized and ipsatized, 
cross-products be used instead of correlations, unities be placed 
in the diagonal, and finally, no rotation of the factor matrix be 
made. Such restrictions as these take the analysis far from the 
realm of common-factor analysis and the practical concerns of 
the data analyst. 

A comparison of the two sets of obtained factors is illustrated for 
geometric objects in a study by Lorr, Jenkins, and Medland (1955). 
Even with only four attribute (trait) factors they found only 
moderate correspondence between sets of factors. Cattell prob- 
ably has been over-reacting to Stephenson’s extravagant claims 
for transpose analysis (called Q-technique by Cattell as con- 
trasted with the conventional or R-technique) and Q-sorting. 
However, Cattell states now “. . . no simple equivalence of mean- 
ingful factors from ordinary R- and Q-technique results.” (1966, 
р. 228). 

Cattell's reaction to Stephenson’s claims may well be part cf the 
basis for his continued ambivalence toward transpose factor analy- 
sis as a means of deriving taxonomic principles. Before Stephen- 
son’s book appeared (1953) Cattell wrote, “Q-technique is most 
useful if one wishes immediately to see how many types there 
are in a population and to divide it up into types.” (1952, p. 101). 
Cattell continues to employ parametric statistics in conventional 
analyses but reverts to nonparametric cluster analysis when seek- 
ing taxonomic principles in the interperson similarities matrix 
derived from distances in the transposed data matrix. He now goes 
so far as to deny that types are factors! (Cattell and Coulter, 
1966, p. 242). Much of Cattell’s disappointment may have come 
from using correlation earlier as the index of similarity between 
people. 

Cronbach and Gleser (1953) pointed out that the product- 
moment r fails to express differences in the average levels (means) 
as well as differences in average scatter (variance) of two pro- 
files. They recommend use of the distance statistic, d. This d is 
the square root of the sum of the squared differences between two 
profiles across all variables. 

Factor analyses have been made of covariance matrices (Horst, 
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1965) and cross-product matrices (Nunnally, 1962) but these 
produce very different results than those obtained by factor analy- 
zing interprofile d's. Use of d instead of covariance ог Cross- 
products permits the extraction of common factors. The only d 
factor analyses reported to date are in connection with Guertin’s 
Suecessive Profile Analysis procedure (1966). It is time a clearcut 
illustration of typing with factor analysis of interprofile d’s appeared 
in the literature. 


Transpose Analysis of Ship Intercorrelations 


Cattell and Coulter provided the ship data they analyzed with 
their “taxonome” procedure (1966, p. 265). There were 12 mea- 
sures on each of 29 vessels from data in Jane’s Fighting Ships 
(1964-65). The object of the task is to find out how many types 
of ships there are and which ships belong to each class. 

All measurements were put in standard score form and are as 
follows: displacement, length, beam, number of light, of medium, 
of heavy, and of very heavy guns, number of personnel, maxi- 
mum speed, submersibility, (obviously dichotomous), continuity 
of deck construction, and number of planes carried. 

Ii we intercorrelate all the 29 ships across the 12 measures, then 
those ships which are alike because they are of the same type 
should be highly intercorrelated. Inspection of the matrix of cor- 
relations should disclose four clusters ОГ submatrices of high 
intercorrelations because, as we find out later, there are four types 
of ships. Each cluster constitutes а group of ships that should be 
alike (in a correlational sense) but bear little "similarity" to mem- 
bers of other clusters. The factors produced can be viewed as 
Possible type categories for the classification job at hand. 

Principal axes from the reduced intercorrelation matrix of the 
29 ships were rotated to give the Varimax rotated factors which ap- 
pear in Table 1. Loadings greater than 49 are italicized for em- 
phasis. The five carriers are clearly identified as belonging to- 
gether in a class by themselves by having positive loadings on factor 
1. The submarines are negatively related to the class and all have 
hegative loadings of at least 40. Thus, the first factor identifies 
another type of ship at the negative end. 

The second factor gives perfect identification of the destroyer 
type and its members. Again submarines are Joaded negatively. 
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TABLE 1 
Varimaz Rotated Matriz for Correlational Analysis of Ships Problem 


Ship Factor 
I п IV 
Carrier 1 .91 .10 —.18 —.29 
2 85 .36 — .30 —.09 
3 .94 .09 —.11 —.23 
4 96 11 —.10 —.27 
5 83 33 — .28 —.19 
Destroyer 1 22 .94 .09 —.03 
2 24 .98 .02 —.06 
3 12 .96 17 —.09 
4 03 .96 .09 .00 
Submarine 1 —.43 —.52 —.55 .42 
2 —.48 — 66 — .62 AL 
3 —.69 —.71 —.29 —.12 
4 — .66 — .48 —.22 —.60 
5 —.54 — .58 — .52 —.22 
6 —.64 —.56 —.28 —.28 
7 —.62 —.66 —.23 —.33 
8 —.40 —.50 —.26 —.22 
9 —.41 —.60 —.66 —.42 
10 — .66 —.68 —.40 .00 
Frigate 1 —.14 .26 .82 44 
2 —.14 „11 -93 28 
3 —.11 .19 86 39 
4 —.15 ‚07 .92 26 
5 —.18 .10 -60 80 
6 —.16 —.20 .46 80 
га —.16 —.05 17 96 
8 —.17 .15 .18 94 
9 —.07 24 94 13 
10 —.15 06 36 91 


Using the cutting line, .49, only submarine number four lies out- 
' side the class with a loading of —.48. 

The third factor centers around the frigates but some class 
members load weakly. Nor is this a case of factor fission where 
variance from the frigate cluster is pulled out into another dimen- 
sion because too many factors were rotated. We can be sure of this 
because the R matrix (not reported here) shows frigate 4 nega- 
tively correlated with frigates 7 and 8. The fourth factor behaves 
much like the third in getting only half of the frigates to load 
heavily. 

The overall results stack up to indicate the two clear types of 
carrier and destroyer with unequivocal identification of the mem- 
bers of each. Submarines are less homogeneous but are recogniz- 
able as a type or two subtypes. The frigates break into two dis- 
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tinet types. The results of the analysis are by no means satisfactory 
во we are led to consider other possibilities for analysis, namely, 
factor analyzing d. 


Transpose Analysis of Ship Distances 


The existing concepts for factor-space based upon а correlation 
matrix are not appropriately named for dealing with factor-space 
derived from a distance-based index of similarity. A distinction 
must be made between test-space (Thurstone's term which in 
more general form would be attribute-space) and person-space. 
On the other hand a distinction must be made between both test- 
or person-space based upon relation indices and that from dis- 
tance indices. When space is used as the analogue of relationship 
the term relation-space is appropriate. When space is the analogue 


TABLE 2 
Varimaz Rotated Matrix for Distance Analysis of Ships Problem 

Ship Factor 
I п ш IV 
Carrier i 78 2T 26 24 
2 82 23 —.02 00 
3 89 11 01 03 
4 89 20 14 14 
Б 72 38 27 25 
Destroyer 1 39 80 12 24 
2 39 81 19 29 
3 32 78 23 37 
4 27 82 15 30 
Submarine 1 17 15 91 22 
2 15 14 91 24 
3 10 10 84 36 
4 15 18 88 25 
5 07 08 89 22 
6 01 09 86 23 
T 05 11 85 30 
8 10 12 70 45 
17 15 89 20 


ы 


«0-10 л ҥе Oo to oo 
к 
= 
e: 
© 
So 
++ 


Frigate 


HE 
eo 
H 
© 
5 
a 
© 


42 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


for representing distances between people or tests the term distance- 
space is appropriate. 

The statistic d is a distance-based measure of dissimilarity so 
it is necessary to transform it to a similarity index. The d’s can be 
changed to similarity indices by simply reflecting them, i.e. sub- 
tract them from a constant. The arbitrary constant from which 
each is subtracted is not critical: we use the largest d in the off- 
diagonal elements. 

Subtracting a distance measure from the largest value present 
is to employ a very arbitrary reference value. The largest d value 
will depend upon how far apart the two most dissimilar profiles 
are. While we cannot eliminate the noncomparability of the 
magnitude of d between studies, we can eliminate the effect from 
the interprofile similarity matrix. This will be done for the 
examples used here by dividing each element of the reflected ma- 
trix by the largest element in it. Thus, the magnitudes of the index 
range from 0 to 1, and the dependence of the index on the arbitrary 
magnitude of score scales is eliminated. The result is a new index 
and we refer to it as the distance similarity index or DSI. 

The concept of common person-distance factor-space appears 
sound. Person-distance communality can be estimated by taking 
the largest column index value and factoring iteratively to get a 
final value, thus supporting the analogy between the two types 
of factor-space. The correlation index of interprofile similarity 
gave type-factors with uncertain correspondence to actual classes 
of vessels. What would a factor analysis of interprofile d’s show? 
Table 2 gives the Varimax rotated matrix for the analysis. The 
communality estimate employed was the largest value in the column. 


It has been demonstrated (Guertin, 1969) that such estimates are - 


satisfactory. 

This time the rotated matrix unequivocally indicates four classes 
of vessels. The italicized “loadings” are those above .49. None of 
these appreciable loadings is misplaced and none needed to indicate 
true class membership is missing. 


Illustration of Computations 
To remove any ambiguity about computational procedure some 
of the calculations will be presented here. It will suffice to work 
with the upper left submatrix (first three profiles). 


Рес 


WILSON Н. GUERTIN 403 


TABLE 3 
First Three Profile Submatrices 
Ships Sum d! 
Variables 1 2 3 1 2 3 
1 1.50330 1.84090 1.81500 1 14.2280 10.3423 
2 1.12770 2.82440 2.88780 2 4.2875 
3 1.45080 2.31550 2.31550 3 
4 .84196 1.88100 1.88100 
5 1.77480 2.07140 2.56170 Square Root Sum d? 
6 —.09409 2.06010 .04952 1 2 3 
| 7 —.50175 .72136 .72186 1 3.7721 3.2159 
8 —.63840 —.63840 —.63840 2 2.0706 
9 1.22420 2.89220 2.89220 3 
10 .53561 1.16940 1.16940 
11 —.72548 —.72548 —.72548 BIG-d 
| 12 —.63830 —.63830 — .63830 1 2 3 
1 5.0047 6.1609 
2 7.3062 
3 
d/BIG' or DSI 
1 2 
1 .8781 .6118 .0725 
2 ‚7975 .7975 
3 .8353 


Table 3 gives the successive submatrices from standard scores 
to the final similarity index values. The sum d? values are derived 
from the sum of the squared differences between pairs of scores 
ior ships on all 12 variables. The next step is to take the square 
root. The index for 1-2 is 3.7721 and the largest in the submatrix 
but the largest value in the complete matrix (not shown here) is 
that between 2 and 19. It is labelled "BIG" and is equal to 
9.3768. Computations have been carried to 10 decimal place ac- 
curacy but rounded to four places for presentation in the tables. 

Next, all d’s are subtracted from BIG. While 2-3 is the largest in 
the resulting submatrix with a value of 7.3062, that in the com- 
plete matrix (not shown here) is 9.1910 for ships 20-22, BIG’ = 
9.1610 is used to divide each element so these final similiarity index 
Values range from 0.00 to 1.00. It is this last matrix with com- 
munality estimates inserted in the diagonal which is factor analyzed. 


Summary 
. The value of distance indices for profile matching is discussed 
| in the literature occasionally and demonstrated even less fre- 
|^ quently. Similarly, transpose factor analysis has been the subject 
| 
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of hot debate but little action. The value of these procedures in 
identifying classes for the assignment of individuals with recur- 
ring similar profiles by analyzing their profiles needs to be illus- 
trated. 

Cattell supplied 12 measures on each of 29 fighting ships. Each 
ship was known to be a earrier, destroyer, submarine, or frigate. 
Intercorrelation of the transposed score matrix with multiple cor- 
relation estimates of communality in the diagonal was factored 
and rotated to а Varimax solution. It failed to clearly identify the 
number of classes and which ships belonged in each. 

Distance measures between profiles were computed, reflected 
to represent similarities, and then divided by the maximum value 
in the interprofile matrix of similarities. The values in this final 
matrix which was factor analyzed must necessarily range from 
0.00 to 1.00. 1% is proposed that this index based upon d be called 
the distance similarity index and abbreviated as DSI. 

The transpose factor analysis of the DSI interprofile matrix was 
much more successful than the transpose factor analysis of the 
intercorrelations of profiles. In fact, the DSI analysis produced 
the four type-factors corresponding to the four ship-classes. 
Furthermore, each ship was unequivocally classified correctly by 
taking the highest of the four type-loadings as the basis. 
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CONSIDERATIONS WHEN MAKING INFERENCES 
WITHIN THE ANALYSIS OF COVARIANCE MODEL? 


CHARLES E. WERTS AND ROBERT L. LINN 


Ix his discussion of multiple regression, Cohen (1968) notes 
that the analysis of covariance (ANCOVA) is equivalent to а 
regression analysis in which treatment groups are coded as dummy 
variables. The numerator of the traditional ANCOVA F test 
(MeNemar, 1962) is equivalent to the squared total multiple cor- 
relation of dummy variables and covariates with the dependent 
variable minus the squared multiple correlation of the covariates 
with the dependent variable and the denominator of the F test 
is equivalent to the error variance, ie, one minus the squared 
total multiple correlation. The numerator is therefore what Dar- 
lington (1968) calls “usefulness,” ie. the proportion of variance 
that the dummy variables add to the prediction of the dependent 
variables in a stepwise regression after the covariates have entered. 
However, Linn and Werts (1969) point out that “usefulness” re 
quires a number of assumptions when used in making causal in- 
ferences. It follows that the considerations listed by Linn and 
Werts (1969) can be specialized to the analysis of covariance 
method as shall be demonstrated below. 


The ANCOVA Model 
For purposes of discussion consider the case of a single co- 
variate (X), three treatment groups (j = 1,2, 3), and à dependent 
variable, Y. The mathematical model for ANCOVA is then: 


у, = A; + В.Х + ei (1) 


22a STE 
m i d pursuant to Grant No. OEG- 
research reported herein was performed pursven r U. S. De- 
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where 

A, = the Y intercept of the Y on X regression line for group j. 
Also 


A, = Y, — B,X, (X, and P, are the respective means of X 
and Y for group j), 

B, = pooled within group regression slope, and 

ё = error term assumed to be independent of A, and X,, 
and with zero mean. 


In addition to the usual linear regression assumptions, ¢!.is model 
requires homogeneity of within-group regression. The t:uditional 
ANCOVA procedure also requires that the treatment effect (Le., Ay) 
and the covariate (X,;) be independent (Evans and Anastasio, 1968). 
Furthermore, it must follow from substantive theoretical considera- 
tions that the treatment means should be adjusted (Smith, 1957) 
using the within-group regression slope (i.e, А, = f, — В.Х). 
Providing reasonable justification for the use of ANCOVA in non- 
experimental situations may be quite difficult as Lord (1969) clearly 
demonstrates. Because of this it is frequently necessary to abandon 
strong causal interpretation and to develop insights that establish 
models in which causal inferences can be tested later. 

Following Cohen’s (1968) procedure, equation (1) can be trans- 
lated for computational purposes to the dummy variable form: 


Ya = BZ, + BaZ, + В,2, + В„Х + es @ 
where the dummy variable coding is: 

Z, = 1 for everybody 

Z, = 1 for persons in group #2, 0 for others, everybody, 

23 = 1 for persons in group #3, 0 for others. 


When equation (2) is solved in the usual regression program the 
A; intercepts in equation (1) can be computed directly from the 
regression weights in equation (2) as shown by Johnston (1960), i.e. 

1 = Bı, A» = В, + B, and А» = B, + B,. Since Cohen's pro- 
cedure involves the computation of correlations, it is useful to con- 
vert equation (2) to its standardized form: 


Vii = batza + bites + utri + es, (8) 
where lower case letters indicate standardized variables and 5;* are 


WERTS AND LINN 40 


the standardized regression weights. b,* is zero since Z, in equation 
(2) is a constant. 

The “usefulness” of the dummy variables in terms of the total 
multiple correlation R.,.,,,,, is then? 


Ruts,.2.0,.2) = "usefulness" = т сареро I 


where Rsu is the multiple part correlation of the dummy 
variables with the dependent variable, covariate partialed out of the 
dummy variables. The subscript 22.2 indicates the residual of 2, with 
z controlled and лут is the residual for z, with z controlled. 
As should be suspected from а comparison of equations (1), (2), 
and (3) the multiple part correlation Rycs,.2s.2 is equal to the part 
correlation Ra. xy, i.e the correlation of the dependent variable 
with the intercepts when the covariate is partialed out of the inter- 
серїз. In other words, we could have taken the intercepts caleulated 
in the dummy variable analysis, inserted these in equation (1), and 
computed the "usefulness" or "adjusted treatment variance" from 
the correlations among X,;, Yi; and A,: 


2 
Ru:»r me оа Rre en (Ray — RaxRyx = "usefulness." 
(1 — Rar) 


This leads to the useful interpretation that ANCOVA and its 
dummy variable regression form is equivalent to assigning 
person in a group the value of the “effect” (ie., Aj) for that 
group, this effect being scaled in units of the dependent variable. 
One may then think in terms of a single “treatment” variable 
Which is one of the independent variables in the regression equa- 
tion with the covariates as the remaining independent variables, 
though considerations about degrees of freedom are à little more 
complicated this way. 


Application of Alternate Procedures 


The essence of the above argument is that a categorical variable 
(treatment groups) may for analytical purposes be replaced by а 


S‏ س 

2 The notation for the multiple partial beta із our own, үөт re ded 
no references to such а coefficient other than Werts (1968). Paralleling a 
(1957) notation, the first letter of the subscript refers to tho dependent ea 
able, the letters in the parentheses refer to the relevant independent чата les, 
le, the dummy variables za and za and the letter che lol nnt 
trolled variable. 
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single variable representing the treatment "effect" on the dependent 
variable. According to ANCOVA the proportion of variance in the 
dependent variable that this treatment variable "accounts for" is 
the squared part correlation Ry, x)y*, however, several other proce- 
dures exist for calculating the proportion of “unique” variance in 
the dependent variable that a variable “accounts for.” In addition 
to Кл. хуу?, the following procedures could be and have been em- 
ployed in regression analysis: 


а. The multiple partial correlation №... к.а of the dummy 
variables with the dependent variable 


Ray — R,xR x) c 

(1 — Rix a= Кух) 

b. The multiple part correlation тү, „у‹„,.„„› of the dummy variable 
with the dependent variable, covariate(s) partialed out of the 
dependent variable: 


Ray — RaxRyx)* 
Ecos) io: Riro z lem 


с. The multiple standardized partial regression coefficient bys... 
of the dummy variable with the dependent variable when the 
covariates are also independent variables: 


0.2501 = rı.) = (b,*)* T (b,*)* + 2b,*b;*r;, 


pm Ray De RaxRyx) 
(1 — Rar)? 
where т is the correlation of Z2 and 23. 
Thus four techniques for obtaining the proportion of variance that 


the categorical treatment variable “accounts for” are possible in & 
regression analysis: 


з 2 
Ris) esasa) = Rrı.x = 


Method 1. The squared partial correlation Рул x? 
Method 2. The squared part correlation Бул xj? 
Method 3. The squared part correlation Вл (у.х) 
Method 4. The squared beta weight (Бул x*)2. 


Insofar as the researcher wishes only to test the null hypothesis 
of no improved predictability effect, it does not matter which ар- 
proach is used since the same Ё ratio and degrees of freedom for 
testing statistical significance is applicable to all four methods. The 
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1 F test of the unstandardized regression weight Byax [equals unity 
аз seen in equation (1)] also yields the identical level of signifi- 
ance. If the researcher wishes to interpret the magnitude of the 
“proportion of variance acounted for” as indicating а large or small 
treatment effect, then the theoretical considerations detailed by 
Linn and Werts (1969) become relevant, for example, Method 3 
(Rarrx)) requires the assumption that the “treatment effect" and 
the covariate be independent. Method 4 (bya.x*) is the only one 
of the four methods which allows for X and A to be noninde- 
pendent determinants of Y. Even for Method 4, however, a strong 
model must be adopted to enable the regression weight to be in- 
ferpreted as a "treatment effect.” 


Implications for Understanding and Extending ANCOVA 


The foregoing discussion indicates that given the assumption of 
Within-group homogeneity of regression the ANCOVA model із а 
Special case of the general linear regression model. To understand 
the traditional ANCOVA terminology in terms of this regression 
Spproach, consider Figure 1 which is а diagrammatic representa- 
tion of the model Yy = А; + bu Xy + Cu 

In this model the covariate and the treatment variable are de- 
terminants of the dependent variable Y as indicated by the respec- 
tive arrows. The double headed arrow between A; and Xy indicates 
& correlation and the isolated arrow from ey to Y represents the 


Figure 1. Representation of ANCOVA Model. 
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assumption that the arrows are assumed independent of А; and 
Ху or in causal language that all other (implicit) unmeasured fac- 
tors influencing Y are assumed to be independent of both the co- 
variate and the treatment variable. 


The ANCOVA assumption that treatments and covariate be 
independent amounts to the assertion that any correlation between 
A, and Х;; will arise solely from sampling errors which is equivalent 
to asserting that the between groups or external slope Brg will 
differ from the within groups or internal slope В„ only because of 
sampling errors. Because of this assumption ANCOVA is seldom 
applicable to naturalistic studies where treatment group and co- 
variate are usually associated because of systematic, nonrandom 
influences. However, techniques have been worked out within the 
framework of regression analysis for dealing with independent vari- 
ables which are systematically related; discussions of these techniques 
сап be found in a number of sources such as Turner ani Stevens 
(1959), Tukey (1954), Blalock (1961), and Duncan (1966). 


It is preferable to adopt a regression approach rather than pre- 
tend to adopt ANCOVA when its assumptions are violated. It 
also facilitates thinking in terms of a model of causality that ex- 
plicitly allows for an association between the covariate on the 
treatment. Suppose that the covariate influenced which treatment 
group a subject was assigned to, e.g., it seems likely that a student’s 
family background (school input or covariate) influences not 
only which school (treatment group) he attends but how much 
he learns (i.e., the output) while attending that school. In that event, 
given the necessary independence assumptions, the equations might 
reduce to the “recursive” set of structural equations: 


КЕЧСЕ Вах Fe) 
А; = В.Х;; + Oi. 
The meaning of this set is that treatments and covariate influence 
the dependent variable and the covariate also influences treat- 
ments. Blalock (1961) gives a most comprehensive analysis of re- 
cursive sets and their interpretation in linear systems. 
Given the regression form of the ANCOVA model, it is then pos- 


sible to understand more precisely the consequences of violating the 
independence assumption (Evans and Anastasio, 1968). In this re- 
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gard, Winer (1962) states: “When the covariate is actually affected 
by the treatment, the adjustment process removes more than an 
error component from the criterion, it also removes part of the 
treatment effect.” In more detail, it follows from the principles of 
regression analysis that when there are two independent variables, 
ie, А; and Xy, then removing from the predictable variance all 
variance associated with one of them, i.e., Xj, will necessarily re- 
move that part of the variance ascribable to the other variable, i.e., 
Аз, which is correlated with the first variable. Thus if A; and Xy 
are correlated тах because of nonindependence, the variance pre- 
dictable from Xy will include тах? o4 of the variance in Ау. The 
adjusted treatment variance in ANCOVA will therefore be too small 
(ie, biased) by the factor (1 — rax?) when treatment effect and 
covariate are nonindependent. In ANCOVA language, the adjusted 
treatment sums of squares (ie, Tyyr) will equal the sums of 
squares (ie., Туул) of the adjusted treatment means times the 
factor (1 — rax?). The usual formula relating the between-groups 
regression slope to the within-groups slope [e.g., equation (1) or 
(2) on p. 585 of Winer, 1962] can be converted to show the same 
relationship. In the presence of nonindependence the usual 
ANCOVA procedure is therefore biased, yielding an underestimate 
of the treatment variance. This problem is & well-known regression 
phenomenon, discussions of which may be found in most econ- 
ometries texts (e.g, Malinvaud, 1966). An important advantage 
of casting the analysis of covariance їп & regression framework is 
that the various techniques like “elaboration,” contextual analysis, 
ecological correlation, latent structure analysis, and Guttman scale 
analysis, which Schuessler (1968) shows can be expressed in terms 
of analysis of covariance, can all be handled by regression analysis. 


Further Extensions 
When the ANCOVA model is stated in regression terms the 
covariate is simply another independent variable in the regression 
equation. The covariate itself could be а categorical variable (e. 
Bx) expressed as a set of dummy variables in the regression equa- 
tion; in which case the model equation becomes: 


Үү, = А; + Bet 6i 


Johnston (1960) gives an excellent account of how to веб up 
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the dummy variable analysis for this ease and others. Note that 
this model equation is that of а two-way analysis of varissm 
which, however, requires the assumption of independence of the 
two categorical variables А, and Bg. The dummy variable analysis. — 
ealeulates least squares estimates of A, and Bg even if A, and Bg am 
nonindependent. Given А, and Bg, the problem resolves itself inte 
the familiar regression problem of two nonindependent continuo 
variables which can be analyzed with any appropriate regression 
technique mentioned earlier. That it is possible to analyze date 
¢eross-categorized in two or more ways does not, of course, meas 
that it is meaningful to do so. 


Application to the Analysis of Compositional Effects 


in а “compositional” effect on some outcome variable, an effect 
which is not predictable from the individual characteristics alone. 
If the group mean X, were the relevant indicator of composition, 
Xu the individual characteristic, and Y,, the outcome then in the 
regression model Y, = В, + В.Х + B,X, + е, the partial 
regression weight for X,(B,) represents the net compositional im- 
fluence with individual characteristics held constant and the weight 
for X,,(B,) represents the net individual influence with composition 
controlled. A least squares solution yields В, = By (i.e., the pooled 
within groups slope) and B, = Bry — By (1.е., the between minus 
the within groups slopes). 

If treatments are independent of the covariate in ANCOVA the 
external slope (Вет) equals the internal slope (By) which would 
mean that B, = 0 and that there would be no “compositional” 
effect. It can be shown that В, із the regression weight for A, on Ñ; 
as follows: 

1. 4, = P, — В„Ў,. 

2. Which has the normal equation 


our = orp — Byor. 
3. Dividing by cs" yields 
бат _ ore _ 
eT et Bn 
or Baz = Bes — By. 
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& Therefore В, = Bar = Bre — Ве. 


relationship of the ANCOVA model to the “compositional 
is clarified further by noting that in the equation 


Yu = В, + BXu + BR, + ВА, tn 


B, = By, B, = zero, and В, = unity which means that this 
ion reduces to the ANCOVA model. The implication is that 
“compositional” effect is in fact part of the “treatment” effect 
ANCOVA. 
To summarize: (a) the analysis of “compositional” effects come- 
to the ANCOVA model in which treatments are not inde 
pendent of the covariate, which means that the assumptions of the 
ANCOVA model, especially within group homogeneity of regression 
apply also to compositional analysis; (b) the regression slope (Bas) 
of the intercepts (i.e., A, in the ANCOVA model) on the 
indicator (1.е., Ў,) represents the net influence of composition; and 
(e) the “compositional” effect is part of the “treatment” effect. in 
the ANCOVA model. 
Overview 

It is well known that analysis of variance or analysis of eo 
"Variance are simply specialized cases of the general linear equs- 
tion, cases which are appropriate almost exclusively for experi- 
Mental studies in which such assumptions as independence of 
covariate and treatments are plausible. Naturalistic and quasi-ex- 
perimental studies, however, require analytical models which al- 
low for the covariate to influence or be influenced by treatments 
or for nonindependent sets of categorical and/or continuous vari- 
ables. Given modern computing facilities all analyses сап be 
ciently set up in a general linear format even if the requirements 
for ANOVA or ANCOVA are given. However, in the general 
linear model one is then faced with choosing among а variety of 
analytical procedures such as multiple partial ; multiple 
. Part correlations, and multiple partial regression weights (stan- 
dardized and/or unstandardized). Which procedure to use = 
how to interpret the results provides the real test of a good те- 
Searcher since good choices depend on an arde 
Phenomena being studied and of the relationship of the phenomena 
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to the mathematical model underlying the statistics. When the 
assumptions of an ANCOVA model are violated, causal inference 
must rest upon the assumption that you are dealing with a closed 
system or that all other variables that might influence the depen- 
dent variable are unrelated to the treatment or the covariate. 
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HOW TO WRITE TRUE-FALSE TEST ITEMS 


ROBERT L. EBEL 
Michigan State University 


Why Use True-False Items? 

The basic reason for using true-false test items is that they pro- 
vide a simple and direct means of measuring the essential outeome 
of formal education, which is command of useful verbal knowl- 
edge. For all knowledge can be expressed in a series of propositions, 
and a proposition is simply а sentence that can be said to be true 
or false. Propositions are the substance of knowledge. Judging 
their truth or falsity is the essential task of scholarship in any 
field. 

Some test constructors obtain scores of respectable reliability 
from true-false classroom tests. Examples are given in Table 1. These 
reliabilities indicate that the items in these tests were not seriously 
ambiguous, and that guessing could not have been extensive. If an 
ambiguous true-false item is written, it is the fault of the writer, 
not of the form. Also, when guessing affects test scores seriously, 
it is because the test is too short, the items too difficult or too 
ambiguous or the examinees too poorly motivated. 

Compared with other item forms, true-false test items are rela- 
tively easy to write. They are simple declarative sentences of the 


TABLE 1 
Reliability of True-False Tests 
Date Students Ttems Reliability 
3-6-68 114 110 +85 
4-29-68 100 111 ST 
5-27-68 100 107 -83 
7-8-68 142 99 i .86 
7-23-68 141 90 78 


417 


418 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


kind that make up most oral and written communications. It is 
true that they must reflect careful thought and precise expression, 
since they will be interpreted critically, and since they stand and 
must be judged in isolation. Thus they must be self-contained in 
meaning, depending on content not on context. But the problem 
of true-false item writing is no different from the problem of 
writing for any other purpose of communication. Those who have 
difficulty in writing good true-false test items probably have 
trouble expressing themselves clearly and accurately in other sit- 
uations also. 


Two Requirements for Writing Good True-False Test Items 


The first requirement is mastery of the subject to be tested, 
which in the case of teachers always implies mastery of the 
language in which knowledge of the subject is expressed. In most 
subjects this mastery grows slowly over many years. It can seldom 
be acquired in a single course, and never in a course on test 
construction. But to the degree that it is lacking the tests produced 
are likely to be deficient, and no amount of special training in 
test construction can make up for the deficiency. Many of the 
shortcomings of teacher’s tests, including their true-false tests, are 
due to their inadequate command of the knowledge they are 
trying to test. 

But other shortcomings are due to lack of knowledge of special 
techniques of item writing and test construction. There are 
tricks-of-the-trade, knowledge of which will help any teacher to 
make better tests. To point out some of these that are useful in 
writing true-false test items is the principal mission of this article. 


The Process of Writing True-False Test Items 


A good true-false test item is one that is both acceptable and ef- 
fective. It will be acceptable as a measure of achievement, to one 
who believes the central purpose of formal education is to gain 
command of useful verbal knowledge, if it is significant and 
supportable. To be significant it must test the student’s command 
of some important element of useful knowledge that is not com- 
mon knowledge. To be supportable it must be unquestionably 
true or false in the opinion of a knowledgeable expert. It will be 
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effective if those who lack an adequate grasp of that element, of 
knowledge find a wrong answer attractive. 

There are thus four essentials to writing a good true-false test 
item. 


1. Choose a significant idea. 

2. Devise a problem which will require understanding or appli- 
cation of the idea. 

3. Word the statement of the problem so that those who lack 
understanding of the element being tested will be attracted 
to à wrong answer. 

4. Review the statement critically to make sure that any who do 
understand the point being tested ought to answer it cor- 
rectly. 


The Problem of Significance 


A test item is significant if it deals with an element of knowledge 
that is part of a structure of related concepts, ideas or events, 
and that is likely to be useful on future occasions. Significant 
items in a test are those that deal with the ends rather than the 
means of instruction. When a messenger knocks, jt is the message 
he bears that is usually significant, not whether he knocked three 
times or two. About every item in а good test it should be possible 
to give an affirmative answer to the question, “Does this item test 
an element of knowledge that is really worth knowing?” 

It is good for a test constructor to make each item he writes as 
significant as he can. But if he is writing objective test items he 
should not expect each of them to appear to have tremendous 
individual significance. There are, after all, a great many of them 
in the test. Each one involves directly only one element in a vast 
and complex structure of knowledge. The substantial significance 
of a test score rests on the summation of many lesser significances. 

There is a second reason why the items of even the best true- 
false tests may not appear to be highly significant individually. It 
is because of the typical inverse relation between an item’s ap- 
Parent significance, and the definiteness of its truth or falsity. 
The requirement that each true-false test item be unequivocally 
true or false rules out the use of broad generalizations that might 
appear to be highly significant, but whose truth has not been and 
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perhaps cannot be definitely established. Thus the loss to the tes 
constructor is only apparent, not real The generalization may 
deal with a very important problem, but if its truth is indetermin- 
able it can not give a significant indication of achievement. Re- 
gardless of what form of test item is used, essay or objective, 
one can not obtain definite assessments of competence by asking 
questions that have indefinite answers. 

Faced with the difficult problem of finding true propositions of 
high apparent significance to use as the basis for test items, it is 
reassuring to recall that, as Howard's (Howard, 1943) study 
showed, there is a substantial correlation between the indications 
of achievement given by items of less, and those of more, ap- 
parent significance, This means that the degree of apparent sig- 
nifieance probably is not a crucial factor in the validity of the 
measures of achievement obtained. The difference between good 
and poor command of knowledge in an area of study shows up 
about as clearly on the less significant items as it does on the more 
significant ones. 

It remains true, however, that the acceptability of a test de- 
pends substantially on the apparent significance of the items com- 
posing it. The test constructor is well advised to try to use as 
many highly significant items in his test as he can succeed in 
developing. 


Testing Command of Knowledge 


To test а person's command of an idea or element of knowledge 
is to test his understanding of it. A student who can recognize the 
words in which an idea has been expressed but who cannot recog- 
nize the same idea when it is expressed in different words does not 
have command of it. Or, if he knows the idea only as an isolated 
fact, without seeing how it is related to other ideas he has no 
command of it. Knowledge one has command of is not a miscel- 
laneous collection of separate elements. It is an integrated structure. 
Knowledge one has command of is knowledge one can use 40 
make decisions, draw logical inferences, or solve problems. It is 
usable knowledge. 

Consider how one might test a student's command of Archi- 
medes Principle. It should not be done by offering him the usual 
expression of the principle as a true statement, or some slight 


ROBERT L. EBEL an 
tion of it as а false statement, as has been done in items 1 and 
ow. 

L A body immersed in a fluid is buoyed up by а force equal 
` to the weight of the fluid displaced. (T) 
A body immersed in a fluid is buoyed up by а force equal 
to half of the weight of the fluid displaced. (F) 
Instead the student might be asked to recognize the principle 
Б some alternative statement of it, such as in items 3 and 4 below. 

3. If an object having a certain volume is surround by a liquid 
or gas, the upward force on it equals the weight of that volume 
of the liquid or gas. (T) 
4 The upward force on an object surrounded by & liquid or 
LL. gas is equal to the surface area of the object multiplied by 
the pressure of the liquid or gas surrounding it. (F) 

Or the student might be required to apply the principle in 


and 6 below. 


is exactly the same as that on а one centimeter cube of iron 
— when both are immersed in water. (T) 

6. If an unsoluble object is immersed successively in several 
fluids of different density, the bouyant force upon it in each 
— — ease will vary inversely with the density of the fluids. (F) 

— Sometimes the use of an unconventional example can serve to 
— test understanding of a concept. 

б 7. Distilled water is soft water. (T) 

` It is a popular misconception that true-false test items are 
4 limited to testing for simple factual recall. On the contrary, com- 

Plex and difficult problems can be presented quite effectively in 
— this form. 
` 8. The next term in the series 3, 4, 7, 11, 18, is 29. (T) 7 
Ко. If the sides of a quadrilateral having two adjacent right 
` angles are consecutive whole numbers, and if the shortest 
side is one of the two parallel sides, then the ares of the 
trapezoid is 18 square units. (T) 


___ True-false test items can also be used 


_ аге likely to have specific rather than gen 
аге the elements of historical and literary knowledge not iso- 


lated details, and ought not to be learned as such. 
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of a structure of knowledge, less universal than scientific knowledge 
perhaps, but none the less a structure. А student who has com- 
mand of an element of literary or historieal knowledge will 
understand the ideas involved in it, will be able to draw inferences 
from them, and will know their relations to other ideas. 

Consider, as an illustration, the episode in American history 
known as the Battle of Trenton, described in the passage below. 
A student's command of this segment of historical knowledge 
can be tested by using true-false items such as those numbered 10 
to 15 below. 


The Battle of Trenton* 


The attack on the Hessian troops stationed at Trenton was made 
at dawn on December 26, 1776. During the previous night George 
Washington and 2,500 of his troops had erossed the Delaware 
River from Pennsylvania through floating ice. They landed in 
New Jersey about nine miles above Trenton. Approaching the 
town by two roads, the Ameriean army surprised the Hessian 
outposts and then rushed upon the main body before it could form 
effectively. The charge of the American troops and their fire of 
the artillery and muskets completely disconcerted the enemy. A 
few hundred escaped but the majority (over 900) were surrounded 
and forced to surrender. 
10. Before the Battle of Trenton the American Army crossed 
the Delaware River from west to east. (T) 

11. The Battle of Trenton took place during the second year 
of the Revolutionary War. (T) 

12. The American troops outnumbered the British troops in the 
Battle of Trenton. (T) 

13. Surprise was a major factor in the outcome of the Battle of 
Trenton. (T) 

14. The American army suffered few casualties in the Battle of 
Trenton. (T) 

15. The Battle of Trenton was George Washington’s first major 
success in the Revolutionary War. (T) 

Note that none of these items deals simply and directly with any 
of the specific statements in the description of the battle. All of 


1 Adapted from the Encyclopedia Brittanica Vol. 22 page 218, 1968. 
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them require inferences from these facts, or understanding of the 
relation between this event and other aspects of the war. By во 
doing, they seek to test command of the knowledge. 


The Problem of Definiteness. 


A good true-false test item is definitely true or false in the eyes 
of qualified experts. This means that the point involved must be 
well-established truth. It means that the point must be expressed 
clearly. It means that concise statements of the point should be 
sought, since conciseness usually contributes more to clarity than 
does complexity. t 

Three things can be done beyond those already implied to help 
solve the problem of definiteness in writing true-false test items. 
One is to write, or at least to think of the items in pairs, one true 
and one false such as items 16 to 19 below. 

16. An eclipse of the moon can only occur when the moon is 

full. (T) 
17. An eclipse of the moon can only occur when the moon is 
new. (F) 
18. The average farm in the United States was larger in 1960 
than it was in 1910. (T) 
19. The average farm in the United States was smaller in 1960 
than it was in 1910. (F) 
‚ This procedure helps to clarify what the item is testing and to 
indicate whether or not it is worth testing. It encourages concise- 
ness of expression. It helps avoid items like 90 and 21 below which 
have no plausible alternatives and which therefore would make 
poor true-false test items. 
20. Insurance agencies may be either specialized or general. (T) 
21. Camping has a good past, a better present and an almost 
unlimited future, (T) 

Of course only one member of any pair is used in the same с 
The other is sometimes sufficiently different to be usable in а 
Second test, or in a different form of the test. 

A second thing that сап be done to help solve the problem of 
definiteness is to write statements which call for comparison be- 
tween two specified alternatives as in items 22 and 23 below. 

22. The time from moon rise to moon set is generally longer 

than the time from sunrise to sunset. (T) 
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23. The beneficial effect of а guessing correction, if any, is more 
psychological than statistical. (T) 

Such internal comparison focuses attention clearly on the essential 
question in the item. Of even greater help is the fact that it avoids 
the necessity of using arbitrary standards in judging truth or falsity 
and the resulting possibility that the examiner’s standards might 
differ significantly from those of the examinee. 

The third thing which helps to avoid indefiniteness in true- 
false test, items is careful review of the items after they have been 
written. This can be of value even if done by the author of the 
items himself, after several days have passed, and after the context 
in which the items were written has been forgotten. It can be of 
even greater value if done independently by a competer* colleague. 
Independent review is not likely to supply quality to true-false 
test items that are grossly lacking in it, but it can help to avoid 
errors and ambiguities in communication that sometimes result 
from singularity in point of view or mode of expression. 


Making Wrong Answers Attractive. 


The job of a test item is to discriminate between those who 
have and those who lack command of some element of knowledge. 
Those who have the command should be able to answer the 
question correctly without difficulty. Those who lack it should 
find the wrong answers attractive. To make them so is one of the 
- of item writing. Here are some of the ways in which it can be 

one. 

А. Use more false than true statements in the test. 

When in doubt, students seem more inclined to accept than to 
challenge propositions presented in a true-false test. The experi- 
mental evidence for this inclination is impressive. Of course if they 
come to expect more false items than true, some of the value of 
this technique is lost. But the imbalance is not easy to discover, 
and cannot be counted on confidently in the future even if dis- 
covered. So it continues to work quite well even in classes which 
are aware of it as a possibility, 

B. Word the item so that superficial logic suggests a wrong 

answer. 

24, A rubber ball weighing 100 grams is floating on the surface 

of a pool of water exactly half submerged. An additional 
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downward force of 50 grams would be required to sub- 
merge it completely. (F) 

The ball is half submerged and weighs 100 grams, which gives 
one half of 100 considerable plausibility on а superficial basis. 
The true case is, of course, that if its weight of 100 grams sub- 
merges only half of it, another 100 grams would be required to 
submerge all of it. Superficial logie also would make the incorrect 
answers to these questions seem plausible. 

25. Since students show a wide range of individual differences, 
the ideal measurement situation would be achieved if each 
student could take a different test specially designed to test 
him. (F) 

26. The output voltage of а transformer is determined in part 
by the number of turns on the input coil. (T) 

27. A transformer that will increase the voltage of an alternating 
current can also be used to increase the voltage of а direct 
eurrent. (F) 

C. Make the wrong answer consistent with a popular miscon- 
ception or a popular belief irrelevant to the question. 

28. The effectiveness of tests as tools for measuring achieve- 
ment is lowered by the apprehension students feel for them. 
(F) 

Many students do experience test anxiety, but for most of them 

it facilitates rather than impedes maximum performance. 

29. An achievement test should include enough items to keep 
every student busy during the entire test period. (F) 

Keeping students busy at worthy educational tasks is usually 
commendable, but in this case it would make rate of work count 
too heavily, in most cases, as a determinant of the test score. 

D. Use specific determiners in reverse to confound test wise- 

ness. 

In true-false test items extreme words like always or never 
tend to be used mainly in false statements by unwary item writers, 
Whereas more moderate words like some, often, oF generally tend 
to be used mainly in true statements. When they are so used they 
qualify as “specific determiners” which help testwise but unin- 
formed examinees to answer true-false questions correctly. But 
Some always or never statements are true and some often =- 
generally statements are false. Thus these specific determiners can 


4% EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


be used to attract the student who is merely testwise to a wrong 
answer. 

E. Use phrases in false statements that give them the "ring of 
truth”. 

30. The use of better achievement tests will, in itself, con- 
tribute little or nothing to better achievement. (F) 

The phrases “in itself” and “little or nothing” impart a tone of 
sincerity and rightness to the statement than conceals its falseness 
from the uninformed. 

31. To insure comprehensive measurement of each aspect of 
achievement, different kinds of items must be specifically 
written, in due proportions, to test each different mental 
process the course is intended to develop. (F) 

There is superficial logic to this statement like those illustrated 
under B above. But it also displays the elaborate statement and 
careful qualifications that testwise individuals associate mainly 
with true statements. 

Is а teacher playing fair with his students if he sets out deliber- 
ately to make it easy for some of them to give wrong answers to 
his test items? If he wants to measure achievement validly, that is 
to distinguish correctly between those who have and those who 
lack command of a particular element of knowledge, it is the only 
way he can play fair. The only reason a test constructor sets out 
to make wrong answers attractive to those who lack command of 
the knowledge is so that correct answers will truly indicate the 
achievement they are supposed to indicate. 


Conclusions 

This article has attempted to set forth some reasons why true- 
false test items should be used in measuring educational achieve- 
ment, and some means by which they can be used effectively. In 
the author's teaching experience during the last decade, they have 
proved satisfactory to him and to his students. They have shown 
up well in test analysis. It seems likely that some teachers who 
have been touted off true-false tests could serve themselves and 
their students well by taking another close look at them. 
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A NOTE ON GAYLORD'S "ESTIMATING TEST 
RELIABILITY FROM THE ITEM-TEST 
CORRELATIONS" 


JOHN BOWERS 
University of Illinois 


Gaytorp (1969) demonstrated the algebraic inconsistency of 
Guilford’s (1956) reliability formula, 


na 
r = TF Die z 


- based upon Richardson's (1936) relationship, 

f" fa e 
where 7,, = the average item intercorrelation and fu = the square 
of the average item-test correlation. 

It may be instructive to examine this inconsistency. н 
Variances аге assumed equal, the sum of all elements in ios uon 
variance-covariance matrix is, 

22 уто,” =o ® 
at 
where i= 1, ---,% ТОВ, 
1, +++, n columns, 
т = lwhent =), 
c, = total score variance, — 
c; = the common item variance. 
The sum of all elements in the i-th row is, 


2 тис = 1410400) 
i 


е. 
Ш 


(4) 


and from (3) and (4), when c, and о: are cancelled, 
Dre = Vi PL e 
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When (5) is squared and averaged over n items, 
fa = f. (6) 


This is Richardson's (1936) identity, and is the average of the 
elements in the item intercorrelation matrix including ones in the 
diagonal. 

If #,;’ is defined as the average of the off-diagonal elements in the 
item intercorrelation matrix, then as Gaylord has shown, 

= 2 = 
grum cl ET, 
Ti p CERTE 2 ripe (7) 

When the reliability of each item in а test is estimated by its 
average correlation with the remaining n — 1 items in a test, (Kuder 
and Richardson, 1937), the Kuder-Richardson reliability estimate is 


2 Led 2 Bon 
im посту. (8) 
с; 


which сап be reduced to, 


E, ESTEE 9 

E DEG EDR 9 

Guilford's (1956) formula is erroneous, since Fa? = 7; rather 

than 7,;’. But, as (7) indicates, the error diminishes as n becomes 
large. 


Gaylord's interesting expression (7) is, in this notation, 


Ui ios 2 (10) 
Thus, KR-20, under the assumption of equal item variances, is 
the ratio of two averages calculated from the item intercorrelation 
matrix. When item means are also equivalent, expression (10) is the 
KR-21 reliability estimate. 
The true score variance ,c,", is defined as the product of the 
observed variance and the reliability. From (8), 


a,” = Nos Fij (11) 
so, multiplying (10) by (11), 
c. = no F. (12) 


The definition of reliability as the ratio of true score to observed 
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score variance holds since (12) divided by (11) equals (10). The error 
variance is the difference between (11) and (12), 


2 КЕЙ (13) 
or 
E (Lot). (14) 
n 


which shows, as is well known, that Kuder-Richardson reliabilities 
depend altogether on item intercorrelations. ` 
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THE EFFECTS OF FOREWARNING AND 
PRETESTING ON ATTITUDE CHANGE 


GLORIA COWAN 
Wayne State University 


S. S. KOMORITA 
Indiana University 


McGume (1968) has commented that, ^. . . today's artifact may 
be tomorrow’s independent variable.” An instance of the validity 
of this hypothesis is illustrated in the field of persuasive communi- 
cation where the subject’s suspiciousness of the experimenter's 
intent to persuade has become a substantive issue in its own right. 
In earlier years, the problem of suspicion and intent of the ex- 
Derimenter was studied with а focus on the social psychology 
of the experiment, Orne (1962) suggested that the subject for- 
mulates his own hypotheses about the nature and purpose of the 
experiment, and therefore, the subject’s behavior—derived from 
the demand characteristics of the experiment—may then be a 
response to his own hypotheses. Silverman (1968), for example, 
found more acquiescence to a persuasive message when it was 
presented in the context of a psychological experiment than when 
it was not. 

In recent years, however, two opposing hypotheses, aside from 
the methodological implications of the problem, have been pro- 
Posed: Hovland and Festinger's classical view that awareness and 
Warning evoke resistance and decrease the persuasive impact of a 
communication and McGuire’s hypothesis (1968) that suspicious- 
hess of intent may actually enhance the impact of the message 
by increasing message reception. 

The main purpose of this experiment was to explore the rela- 
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tionship between the subject's awareness of the intent of the 
experimenter, Orne's “demand characteristics,” and attitude change. 
The extent to which the subject’s perception of how he is ex- 
pected to respond to a communication is related to his actual 
response in an attitude change context has both methodological and 
substantive implications. The substantive issue of forewarning may 
be tested by experimentally forcing the subject to indicate his 
perception of the intent of the experimenter before he indicates 
attitude change (forewarning) vs. after he indicates attitude change 
(no forewarning.) 

An essential mediating step between general suspiciousness of 
the intent of the experimenter and attitude change is the direction 
and degree of change the subject thinks the experimenter expects 
him to show. If all subjects’ evaluations of the intent of the experi- 
menter are the same, a unidirectional and constant effect of sus- 
piciousness on attitude change would be predicted, and forewarning 
should increase (or decrease) the impact of the communication— 
depending on what theoretical position, Hovland or McGuire, is 
taken. However, if the mean amount of attitude change of a fore- 
warned group does not differ significantly from that of a group not 
forewarned, and the particular hypotheses vary with the subject, 
forewarning should increase the correlation between the subject's 
hypothesis and his experimental behavior. 

If we look hypothetically at the possible kinds of hypotheses 
available to the S, there are three possibilities. The S may hypoth- 
esize that: (a) he is expected to change in a favorable direction, 
(b) he should resist influence and should show that he is not 80 
easily persuaded, and (е) the intent of the Ё is to show the fallacy 
of a one-sided communication, thus suggesting the opposite or 4 
boomerang effect. Rather than to assume similar demand charac- 
teristics on the part of different subjects, the hypotheses or sus- 
picions may be directly measured. 

A secondary purpose of the study was to determine if pre- 
testing sensitizes the subject to the experimental treatment by 
providing additional demand characteristics than those available 
in the “after-only” design. Pretesting can also be viewed as 4 
suspicion arouser, decreasing attitude change, or an enhancer of 
attitude change by increasing the clarity of the demand charac- 
teristics. Although there is little evidence to indicate that the pre- 
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test sensitizes subjects to the experimental treatment (e.g., Lana, 
1959), the pretest, if anything, tends to reduce the effect of the 
experimental treatment (Hovland, Lumsdaine, and Sheffield, 
1949). Accordingly, it is plausible that the effects of the “pre- 
post” design, as compared to the “after-only” design, may inter- 
act with the demand characteristics of the experiment. Initial 
attitude toward the issue provides, at the same time, another link 
in the suspiciousness change chain. Although the subjects initial 
altitude toward the issue has been investigated via discrepancy 
between position urged in the message and S's initial position and 
attitude change, it is plausible that the perception of the intent of 
the message, and thus, the direction of suspiciousness itself, is 
related to the subject’s initial position. Again, pretesting, like 
forewarning, may fail to exert а constant effect (positive or 
negative) on attitude change, but instead may increase the correla- 
tion between the subject’s own hypothesis and his experimental 
behavior. 


Procedure 


Seventy-five students in Introductory Psychology classes were 
given a favorable written communication about advertising. Half 
had been pretested two weeks earlier in their classes with the 
advertising issue embedded in nine other issues. The other half 
Were not given a pretest. Both the pretest and posttest consisted 
of five evaluative semantic differential scales. 

All subjects were asked to read and to evaluate an article on 
advertising appearing in the Saturday Review by Charles Horton, 
an expert on advertising. Before reading the article, the subjects 
Were told that they would be asked to evaluate the readability of 


the article and the point of view of the author. Immediately after 


reading the communication, half the subjects took the posttest and 
ects who responded to 


half took the awareness measure. Those subj Р 
the posttest first then responded to the awareness measure, ап 
those who took the awareness measure first then responded to the 


posttest. Subjects were run in two large groups- 


— h . 
1“ Advertising" i retesting on various issues 
i ertising” was chosen as an issue because р leu close to the se ale 


indicated that the mean о: ising i 
n the advertising 15806 Ур 
neutral point of 20, and the standard deviation of 82 allowed for variability 
In initial attitude, 


44 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The instruction given orally for the awareness measure after 
the posttest were: 


We would like you to fill out one more questionnaire, but the 
instructions for this questionnaire are different from the ones 
you were given on the previous questionnaire. What we would 
like you to do this time is to show how you think we expected 
you to respond to the artiele. Every psychological experiment 
starts with a hypothesis and the experimenter is making pre- 
dictions about how you will respond. What we want you to do 
is to show us what you think we are predicting. 

You can tell us your ideas about what the experiment is about 
by filling out the same sheet you just completed. This time 
answer it in the way in which you think we expected you to 
respond. By filling out this questionnaire, you are telling us 
what you think is the purpose of this experiment. Does every- 
one understand what he is asked to do? Remember, this 
lime you are showing us what you think the experiment is 
about, rather than how you feel about advertising. 

This may or may not be the same as your previous responses. 


The instructions for responding to the awareness measure for 
those subjects who had not yet taken the posttest were essentially 
the same, differing only in those aspects referring to the posttest. 

The awareness instrument was the identical semantic differential 
used as pre and post tests. The semantic differential awareness 
measure forees the subject to indicate quantitatively the direc- 
tion and amount of change he thinks he is expected to show, and 
does not permit him to respond ambiguously about the intent of 
the experimenter. It is also possible to determine the difference 
between his hypothesis and his experimental behavior using this 
instrument. 

Giving the awareness measure before the posttest but not be- 
fore the presentation of the communication eliminates the possibil- 
ity of the subject’s practicing or elaborating his defense. 

For all groups, the correlation coefficients between awareness 
and posttest measures were obtained. For the pretested groups 
the correlation coefficients between pre and post measure an 
between pre and awareness measures were also obtained. The 
design of the study was a 2 X 2 factorial with forewarning (pre 


COWAN AND KOMORITA м 


sentation of the awareness measure before vs. after the posttest 
measure) and pretesting vs. no pretesting as the two variables. 


Results 


Mean differences. Table 1 presents the means of the three re- 
sponse measures: posttest scores, awareness scores, and change 
wores as а function of forewarning and pretesting. The analyses 
of variance of the data in Table 1 indicated that neither the 
main effects of forewarning or pretesting nor the interaction be- 
tween the two variables was significant at the .05 level. Thus, no 
significant directional effect of forewarning or pretesting was 
found on subjects’ responses to the communication or his aware- 
ness of how he was supposed to respond. 

Correlational analyses. The correlation coefficients between 
pretest, awareness, and posttest measures are shown in Table 2. 
It can be seen that the correlations differ significantly as а function 
of forewarning and pretesting. A significant relationship between 
awareness and posttest scores was obtained only in the 
groups given the awareness measure prior to the posttest. Since 
the sample sizes were quite small, it was decided to replicate the 
study for the pretest groups, particularly so that pretest level 
eould be assessed, Table 2 also shows the data for the replication 
(Sample 2). It can be seen that а similar pattern of correlations 
was obtained for Sample 2, indicating that these differences are 
quite reliable. 

For the combined pretested groups, the .60 correlation between 
awareness and posttest for the forewarned group is significantly 


greater than the .01 correlation for the non-forewarned group 


TABLE 1 $ 
Mean Posttest, Awareness, and Change Scores as a Function of Order 
Z2 — ب‎ an MA 
ee o 
Pretested, Pretested, Not Pretested Not 
Posttest Awareness Posttest a oral 
first first first (n= 20) 
р (n = 19) (n = 18) (n = 18) 24.80 
зе, 23.36 25.83 355 21.40 
Awareness 27.36 30.00 26.11 
Change 
Mette) ът МБО — EE 
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TABLE 2 
Correlations between Pretest, Postiest, and Awareness 


Posttest Pretest 
Awareness Pre-post Awareness 
A. Pretested Groups 
1. Posttest first 
Combined (n = 57) .01 .69** ss 
Sample 1 (n = 19) .18 .70** .41*% 
Sample 2 (n = 38) .07 .69** .29* 
2. Awareness first 
Combined (n = 42) .60** .09 .08 
Sample 1 (n — 18) .57* .24 .07 
Sample 2 (n = 24) .63** .05 .09 
B. Post-only groups 
1. Posttest first 
Sample 1 (л = 18) .07 — — 
2. Awareness first 
Sample 1 (п = 20) .26 == = 
* p < .05. 
** p < 01. 
(one-tailed tests) 


(z = 3.25, p < .01), but is not significantly greater than the 26 
correlation in the forewarned but not pretested group (z = 1.35). 
Column 2 of Table 2 also shows that there is a high correlation 
between pre and post test for the non-forewarned group (.69), 
but very little relationship between pre and post test for the 
forewarned group (.09). The difference between the two correla- 
tions is significant at the .01 level (z = 3.73). 

There is a significant relationship between pretest and aware- 
ness scores, as seen in column 3 of the correlational table, for the 
non-forewarned group (.33) but not for the forewarned group 
(.08) ; however, the difference between the two correlations is not 
significant (2 = 1.23). Thus, although forewarning and pretesting 
do not affect mean awareness or posttest scores, they do appear 
to affect the relationship between these measures. 

A supplementary analysis was conducted to test the interaction 
between pretest level and awareness; ie. to determine whether 
those subjects both initially favorable and aware were most likely 
to obtain high posttest scores. Accordingly, all subjects who had 
been pretested were divided into high and low initial attitude 
groups and high and low awareness groups (median splits)- The 
analysis of variance showed a significant main effect of initial atü- 
tude with subjects with higher initial pretest scores, as might be 


n | e нне 
- е 
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expected, also scoring higher on the posttest (F = 1320, p < 
01). There was also a tendency, of borderline significance (F 
= 3.86, р < .06), for more aware subjects to have higher posttest 
scores than less aware subjects, but the interaction between aware- 
ness and pretest level was not significant. 
Discussion 

The results of this study indicate a significant relationship be- 
tween the subject’s awareness of the intent of the experimenter 
and his responses to a communication if and only if he is both 
forewarned and pretested. When the subject has been pretested 
and is forced to formulate and make explicit his suspicions prior to 
posttesting, his posttest attitude toward the issue is significantly 
related to his specific suspicions. Apparently, asking the subject to 
give his hypothesis first allows him to use this new cue, his experi- 
mental hypothesis, to direct his experimental behavior. On the other 
hand, since the mean posttest scores do not vary as à function of 
experimental conditions, suspiciousness ог pretesting does not seem 
to produce a directional effect on attitude change, neither resistance 
to or facilitation of persuasion. Under conditions that make suspi- 
ciousness salient, the particular suspicions or expectations of how he 
should respond are related to the subject’s actual responses. Thus, 
suspiciousness may be operating in an experiment, when aroused, 
but may be masked if the experimental demands, although salient, 
are sufficiently ambiguous to allow variability in how subjects think 
they are expected to respond. : 

The ambiguity of previous findings on warning of persuasive 
intent (McGuire, 1968) can be accounted for in these terms in 
that the nature of the forewarning by the experimenter cannot be 
assumed to have a direct relationship to the way 
prets the forewarning. Hastorf and Piper (1951), for example, 


iound no resistance to suggestion produced by explicitly re- 
minding the subjects that they had answered a pretest and should 
e normative feedback. 


give similar answers after receiving som! В 
It is possible, therefore, that some subjects in the Hastorf and Piper 
study deduced that they were expected to resist the experimenters 
Instructions not to change. Even an explicit statement of intent 
to persuade may not lead to similar deductions by subjects that 
the experimenter expects them to heed the forewarning. 


the subject inter- 
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In the present study, the pattern of relationships between pre- 
test, posttest, and awareness for the treatment group for whom 
suspicion should be greatest (pretested and forewarned) also sug- 
gests that forewarning greatly attentuates the relationship be- 
tween initial attitude toward the issue and attitude after the pre- 
sentation of the persuasive communication. At the same time, it is 
clear that the subject’s hypotheses regarding the intent of the 
experimenter, when forewarned, are independent of his initial 
favorability toward the issue. 

In the nonforewarned group, the relationship between pre 
and posttest is strong, but there is no relationship between aware- 
ness of intent and posttest. As the subject has not been forced to 
make explicit his hypotheses before responding to the communi- 
cation, his response to the communication does not seem to be 
influenced by his specific hypotheses, although the degree of 
favorability toward the issue he thinks he is expected to show 
is moderately related to his initial favorability toward the issue. 
Thus, initial attitude toward the issue does not appear to provide 
a link in understanding the relationship between the subjects 
specific suspicions and his response to the communication. 

Pretesting alone and forewarning alone are not sufficient suspi- 
cion arousers to affect either the direction of response to the 
communication or the relationship of response to the communi- 
cation with the suspicion of the subject; however, both fore- 
warning and pretesting suggest that the subject will respond 10 
the communication in a way consistent with how he thinks he is 
expected to respond. 

From a substantive perspective, these findings support the view 
that salience of intent to persuade operates, but selectively. The 
lack of clear support for forewarning as a facilitator vs. fore- 
warning producing resistance may be due, in part, to the possibility 
that forewarning has variable effects on the particular suspicions 
the subject subsequently formulates; consequently, the facilitat- 
ing effects of the subject’s particular suspicions may not be obvious 
if the specific suspicions are not assessed. 

From a methodological perspective, our findings do not support 
Orne’s objection to the use of post-experimental inquiries er 
measures of the subject’s hypotheses about the purpose of t 
experiment on the basis that the hypotheses are influenced by di 
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subjects preceding experimental behavior. Moreover, it does not 
seem feasible to assess the hypotheses before the subject responds 
to the communication because his response to the communication 
will be affected by the nature of his hypothesis. This conclusion, 
of course, is qualified by the nature of the awareness measure 
used in this experiment. 

The concern about the sensitizing effects of pretesting, per 
se, does not seem to be warranted since there is no significant 
difference between posttest scores or awareness scores of those 
subjects pretested vs. those subjects who were not pretested. Pre- 
testing, itself, does not seem to provide additional demand cues to 
the subject. The absence of a pretesting sensitization effect should 
be qualified in terms of the precautions taken in this study to 
separate the pretest phase from the posttest phase of mea- 
surement; i.e., embedding pretest with other measures, spacing 
pretest and treatment, using different experimenters for pretest 
and treatment. The major methodological implications of this 
study are that all the concern over sensitization may have been 
overemphasized and that to the extent that these findings may be 
generalized, it does not seem necessary to use Solomon's four 
group design to control for pretest sensitization nor an awareness 
control group. 
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A COMPARATIVE STUDY OF FIVE METHODS OF 
ASSESSING SELF-ESTEEM, DOMINANCE, AND 
DOGMATISM? 


DAVID L. HAMILTON 
Yale University 


Tur past two decades have seen a rapid growth in the number of 
personality scales available for use in research concerned with in- 
dividual differences in experimentally-studied behaviors. Conse- 
quently, one finds several scales with the same trait name or which 
appear to be measuring the same or a similar construct. In com- 
Paring results of experiments which have employed these different 
instruments, one wonders whether the same personality attribute is 
indeed being assessed by the different methods. Without evidence 
to this effect, generalization across studies is at least risky, at most, 
unwarranted. 

i The present study was undertaken to examine the construct уа- 
| lidity of several commonly-used personality measures. Campbell 


and Fiske (1959) have pointed out that correlational evidence for 


the construct validity of a psychological test requires demonstration 


B convergent validity and discriminant validity. They also 
Point out that two tests may correlate highly because they have 
From these considera- 


method, as well as trait, variance in common. Jer 
onstruet validity 


tions it follows that correlational evidence for € 

Tequires that each of at least two traits be measured by at least 
methods. The correlations may be presented in what Campbell an 
E 
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Fiske called a multitrait-multimethod matrix, which may then be 
examined for the convergent and discriminant validity of the yari- 
ous measures. 
A variety of methods have been advocated for the assessment of 
| 
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personality attributes. Probably the most commonly-used approach 
to personality scale construction has been the method of empirical 
keying, in which items gain meaning only through their demon- 
strated ability to differentiate between groups known to differ on 
some external criterion (Meehl, 1945). Although inventories devel- 
oped by this methodology have found wide use, the interscale cor 
relations have frequently been high enough to question the discrim- 
inant validity of scales with quite diverse trait names. Alternatively, 
some writers (e.g Loevinger, 1957) have argued that test con- 
struction should have its roots in a conceptual description of the ' 
construct, of interest, so that items in the measuring instrument will 
reflect the properties of the attribute being assessed. Peterson (1965) 
has advocated an even more direct approach in which the subject 
simply rates himself on the attribute of interest. He described 8 
study by Wetzel (1963) in which simple self-ratings on adjustment 
and introversion-extraversion correlated highly with ratings 0n 
these traits made by parents and peers, while intertrait correlations 
were small. Peterson argued that the evidence for the convergent 
and discriminant validity of lengthy inventories is no more com 
pelling than that reported by Wetzel for the simple self-ratings _ 
at least for these two attributes. As opposed to these self-report - 
methods, the use of peer ratings has been proposed as an assessment 
device which avoids many of the problems inherent in self-deserip- 
tive techniques (Smith, 1967). 

This study examined the convergent and discriminant validity of 
alternative methods of measuring self-esteem, dominance, and dog- 
matism. These attributes were selected because (a) they seem to be 
of primary interest to psychologists (as reflected in the literature) 
and (b) several methods of assessing these traits have been pro- 
posed. Five methods of measuring these attributes were compared: 
empirically-derived true-false inventory scales, conceptually-b 
self-descriptive questionnaires, conceptually-based checklist m 
sures, simple self-ratings, and peer ratings. 
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Subjects 


Subjects in the experiment were 70 male undergraduate students 
st the University of Illinois, All subjects belonged to one of two 
fraternities, 36 subjects belonging to one fraternity and 34 to the 
other. Subjects were not paid individually, but each fraternity was 
paid for its participation. 


Procedure 


Subjects were met in the fraternity house and all subjects in a 
given fraternity were tested at the same time. A packet of test 
materials was given to each subject, the packet consisting of several 
self-rating questionnaires and a form for making peer nominations. 
These instruments are described below. 


Measures 


Method I consisted of measures of the various traits by true- 
false, empirically derived inventory scales. Specifically, items from 
six seales of the California Psychological Inventory (CPI) were 
given to the subjects. Among them were the Dominance and Pii- 
bility scales and two scales which, on the basis of descriptions in the 
CPI Manual (Gough, 1957), were considered relevant to the more 
global concept of self-esteem, the Social Presence and Self-Accept- 
ance seales. Scores on the latter two scales were summed to provide 
an index of self-esteem. 

Method II consisted of measures in which each item requires the 
subject to rate himself on an intensity continuum. Only two instru- 
ments were of interest here: the Dogmatism scale (Rokeach, 1960) 
and the Janis-Field Feelings of Inadequacy scale (Janis and Field, 
1959), which has frequently been used as а measure of self-esteem 
in attitude change research. The present writer knows of no previous 
investigation of this seale's validity, in spite of its wide use in re- 
search on persuasibility. The seale was scored such that high ares 
reflected high self-esteem (i.e., few "feelings of inadequacy”). It 
should be noted that the Dogmatism and Feelings of Inadequacy 


scales differ from the CPI measures of Method I in (at least) two 
ect to indicate the 


ene ways. First, these scales ask the subject to me zii 
legree to which the statements are descriptive of his feelings 
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attitudes, instead of employing dichotomous true-false response cat- 
egories. Second, the item content of these Method II measures was 
designed to reflect the various properties of the construct being 
assessed, whereas item content is given less consideration in tests 
developed by a criterion-group methodology. 

Method III eonsisted of measures derived from the Leary In- 
terpersonal Checklist (ICL) (Leary, 1957). The two main dimen- 
sions in the ICL are called Dominance-Submission and Love-Hate. 
The Checklist was scored for the 16 categories of interpersonal be- 
havior represented by the item domain, and Dominance and Love 
scores were determined for each subject using the formulas sug- 
gested by LaForge (1963). The former scores provided a checklist 
measure of dominance. A measure of self-esteem was then deter- 
mined which, although not a checklist measure in the narrower 
sense, was derived from the Interpersonal Checklist. Subjects com- 
pleted the ICL twice, once checking those items characteristic of 
themselves, and later, on a separate ICL form, checking those items 
characteristic of the way they would ideally like to be. A small 
discrepancy between self and ideal-self concepts has often been con- 
sidered indicative of a high degree of selí-satisfaction, while 8 
large discrepancy suggests a weak self-concept or low self-esteem. 
Dominance and Love scores were determined for each subject for 
both forms, thereby determining the location of each subject’s 
Self and Ideal-Self concepts within the circular framework on the 
ICL. The geometric distance between the Self and Ideal-Self (the 
Self-Ideal Distance) was then computed (LaForge, Leary, Naboi- 
sek, Coffey, and Freedman, 1954). This distance was then taken 88 
an index of the extent of discrepancy between Self and Ideal-Self, 
small distances being considered indicative of high self-esteem. 

Method IV consisted of simple self-ratings. Subjects rated them- 
selves on three seven-point scales similar in format to those used by 
Wetzel (1963). The scales were labelled “Dominant-Submissive, 
“Closed-vs. Open-Minded,” and “High vs. Low Self-Esteem.” 

Method V consisted of peer nominations. The procedures em- 
ployed were similar to those described by Norman (1963). The 5? 
poles of each of 11 dimensions (e.g. *calm-anxious") were liste 
separately, making 22 attributes on which peer nominations werê 
made. For each attribute, each subject was to name the four шеш” 
bers of his fraternity group (limited to those participating in a: 
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study) most characterized by the trait. Peer nomination scores 
were then determined as follows: the number of times a person was 
nominated for the second pole of a dimension was subtracted from 
the number of times he was nominated for the first pole, and a con- 
stant of 35 was added to eliminate negative values. Two of the 
dimensions—“Has High Self-Esteem-Feels Inferior" and “Self-Con- 
fident-Lacks Confidence"—;were considered relevant to self-esteem 
and these scores were summed for a measure of Self-Esteem. 
Similarly, scores on “Domineering-Submissive” constituted a peer- 
tating measure of Dominance, and scores on “Open-minded-Dog- 
matic” and Flexible-Rigid" were summed for a measure of Open- 
mindedness. 

The complete multitrait-multimethod matrix would consist of 
each of the three traits measured by each of the five methods. 
However, two trait-method units were not available: a rating scale 
measure of dominance and a checklist measure of dogmatism. Hence 
only 18 of the desired 15 measures were included. 


Results 


Since the relationships among the measures for the two fraterni- 
ties were highly similar, the data for the two groups were combined. 
The matrix of intercorrelations for the total sample is presented in 
Table 1. The italicized diagonal values are validity coefficients. The 
Solid triangles contain heterotrait-monomethod correlations, while 
the dotted lines indicate heterotrait-heteromethod triangles. À cor- 
relation of 23 or larger is significant beyond the .05 level. It should 
be noted that the checklist measure of self-esteem—distance be- 
tween self and ideal-self concepts on the ICL—is interpreted such 
that the larger the distance, the lower the self-esteem. To make it 
‘onsistent with other self-esteem measures, the signs of all correla- 
tions involving this distance index have been changed. Likewise, 
‘igns of correlations involving the Dogmatism scale have been re- 
versed to make it consistent with the other measures of this trait, 
Which are scored in the “Open-minded” direction. 


Convergent Validity 

_ Evidence for convergent validity may be determined b; b 
ining the italicized diagonal values. These are correlations e- 
tween different methods of measuring the same trait. With respect 


by exam- 
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to the self-esteem measures, the CPI and Janis-Field scales and 
the self-ratings are highly intercorrelated (.67, .58, .60). Moreover, 
each of these measures is significantly related to peer ratings of 
self-esteem, although none of these relationships are very strong. 
Assessment, of self-esteem by a discrepancy between self and ideal- 
self checklist responses appears to be something quite different. The 
self-ideal distance measure failed to correlate highly with any of the 
other indices of self-esteem, although its relationship with the CPI 
measure is significant. 

The evidence for the convergent validity of dominance measures 
is more compelling in that none of the six validity coefficients is be- 
low .39. The correlation of .78 between the CPI and ICL dom- 
inance measures is particularly encouraging, especially since each 
is substantially correlated with peer ratings of this trait. Again, the 
self-ratings show adequate evidence of convergent validity. 

Evidence for the convergent validity of measures of the dogma- 
tism dimension is limited. The CPI Flexibility and (reflected) Dog- 
matism scales correlate .42, but neither measure is substantially 
related to the other indices of open-vs. closed-mindedness. Correla- 
tions of the Flexibility scale with self and peer ratings (.34, .25) are 
significant but not large, while correlations of Dogmatism with 
these ratings are meager (.11, .13). Self-ratings and peer rating 
showed a significant but small relationship. 

Discriminant Validity 

Discriminant validity may be evaluated by examining three 
aspects of the correlation matrix: (1) whether a test's validity coef- 
ficients are higher than its correlation with other variables in hetero” 
trait-heteromethod triangles, (2) whether а test's validity соей- 
cients are higher than its correlations with other traits measured by 
the same method (i.e. its correlations in the heterotrait-mono- 
method triangle), and (3) the extent to which the same pattern of 
correlations among traits occurs in all of the heterotrait triangles- 
The evidence pertaining to the first two of these criteria is not іш" 
pressive. That is, several of the off-diagonal values are of the same 
general magnitude as their corresponding diagonal elements (con- 
vergent validity coefficients). On this basis, then, one might co» 
clude that evidence for discriminant validity is lacking. ч 

Examination of the matrix in terms of the third criterion indi- 
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cates, however, that this conclusion must be partially modified. This 
criterion is concerned with the consistency of interrelationships 
among traits in the heterotrait triangles. If the same traits are being - 
assessed by the different methods, then the same pattern of cor- 
relations among the attributes should occur, regardless of the meth- 
ods employed. In the ideal case in which only valid trait variance 
were assessed, the intertrait correlations would be the same in all | 
heterotrait triangles and would indicate the degree of relationship 
among the traits themselves across persons. The correlations in Ta- 
ble 1 indicate a striking consistency of relationship among the st- 
tributes assessed in this study: self-esteem and dominance have sub- 
stantial common variance, and each of these traits is unrelated to 
the openminded—closedminded dimension. The consistency of these 
relationships is evidenced in two ways. First, of the 20 instances of 
а correlation between a measure of self-esteem and a measure of 
dominance, 18 are significant, 11 of the coefficients being at least 40. 
The only nonsignificant relationships involved the ICL self-ideal 
distance, which apparently does not measure self-esteem as assessed 
by the other indices of this attribute. Second, of the 36 times à 
measure of open vs. closedmindedness is correlated with a measure 
of either self-esteem or dominance, only two are significant. More- 
over, both of these cases (correlations between the Janis-Field and 
Dogmatism scales, and between peer-ratings of dominance and open- 
mindedness) occur in monomethod triangles, where method vari- 
ance is likely to artifactually inflate intertrait correlations. 
Discussion 

The results of this study indicate that the criteria for diserim- 
inant validity set forth by Campbell and Fiske (1959) are not en- 
tirely appropriate for the case of correlated attributes. In terms of 
two of these criteria, evidence for the discriminant validity of the 
self-esteem and dominance measures was generally lacking. Yet the 
consistency of intertrait correlations in the heterotrait triangles В 
impressive; indeed, this third criterion has rarely been met so thor- 
oughly in other studies of this type in the literature, It is important 
to note that it is not simply the consistently high correlations е" 
tween self-esteem and dominance that makes these data compelling; 
rather, it is this finding in conjunction with the consistent lack of re 
lationship of either of these attributes with the measures of open 
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mindedness or flexibility. Clearly, attempts to differentially mea- 
gare self-esteem and dominance have not been successful. This, 
however, does not require that we regard the two as the same con- 
гасі; height and weight are highly correlated, yet are clearly dis- 
finct constructs. The usefulness of maintaining а distinction bè- 
tween two correlated constructs is based on their differential re- 
lstionships with other variables (Kroger, 1968). Such а distinction 
bas generally been made between self-esteem and dominance at the 
fonceptual level. Empirical demonstration of the differential rela- 
tionship of the measures of these constructs to other variables is 
needed. 

Table 1 indicates that the four methods of assessing dominance 
жеге highly intercorrelated. The Dominance scale of the CPI ap- 
pears to be one of the better scales in that inventory. It has con- 
Wistently correlated well with peer ratings of dominance (Dicken, 
1963; Gough, 1957) and has shown greater discriminant validity 
than some other CPI scales (Dicken, 1963). In the present study it 
Was strongly correlated (.78) with the Dominance dimension of the 
ICL and showed substantial relationships with self and peer 
TMings of this attribute. The high correlations of ICL Dominance 
scores with the other measures (.78, .52, 41) support recent state- 
ments that the ICL may be a useful instrument and that further 
investigation into its psychometric properties is warranted (Bentler, 
1965; Wiggins, 1968). Simple self ratings of dominanee showed con- 
E" and strong relationships with the other three methods (54, 

,.50). 

The five alternative methods of measuring self-esteem did not 
yield consistently high intercorrelations. However, the relationships 
among the CPI, Janis-Field, and self rating measures (67, 58, .60) 
indicate that these three clearly form a cluster and are tapping the 
fame attribute. Peer ratings of self-esteem were significantly cor- 
related with each of these measures (24, 23, 33) but the amount of 
Variance they share with the highly-interrelated self-descriptive 
Methods indicates that these ratings clearly are not a part of this 

_ cluster. The self-ideal distance measure Was unrelated to peer ratings 
(—02) and was only modestly correlated with the other three in- 


dices of self-esteem (26, 20, 12). What we have, then, is a cluster 


of three highly i 1i-descriptive measures, this cluster 
ighly intercorrelated se. Sad eta tue aid 


having partial overlap with both the peer т 
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distance index; but these latter measures covarying with inde- 
pendent portions of the common variance shared by the three self- 
report measures. 

The data of the present study do not permit a conclusive state- 
ment as to why the peer ratings did not correlate higher with the 
cluster of self-descriptive measures. One possibility is that the in- 
terrelationships among the three self-report methods were increased 
by common variance due to social desirability and other influences 
that enter into paper-and-pencil techniques. This interpretation 
would view peer ratings as representing a greater portion of true 
variance because of less susceptibility to artifactual influences. An 
alternative possibility has to do with the frame of reference from 
which the judgments are made. A person describing himself has ac- 
cess to a great deal of private experience (feelings, thoughts, etc.) 
not available to one who is judging another person. This interpreta- 
tion might consider the self-report methods as possessing greater 
true variance than the peer ratings, which might then be considered 
аз based primarily on a person's social stimulus value. 

The Flexibility scale of the CPI correlated significantly with each 
of the other three measures of this attribute (. 42, .34, .25) ; the only 
other significant convergent validity coefficient was between self and 
peer ratings of openmindedness. The Dogmatism scale was signifi- 
cantly correlated with the Flexibility scale, to an extent similar to 
that found in previous studies (Korn апа Giddan, 1964; Rokeach, 
McGovney, and Denny, 1960), but was unrelated to either self or 
peer ratings. The peer rating measure was a combination of peer 
nominations on two dimensions—‘Openminded-Dogmatic” and 
"Flexible-Rigid." Although these dimensions are clearly related, 
Rokeach et al. (1960) have argued for a conceptual distinction be- 
tween rigid and closed-minded thinking. Indeed, the correlation 
between these two dimensions of peer nominations was .43, almost 
exactly the same as that between the (inverted) Dogmatism and 
Flexibility seales (.42). It might be argued, then, that one should 
not expect a high correlation between Dogmatism and the combined 
peer rating variable. However, the correlation between Dogmatis™ 
scores and ratings on the single dimension of *Openminded-Do£- 
matic” was only —.06. It should also be noted in this regard that 
self ratings made on a "Closed-vs. Open-minded" dimension were 
significantly correlated with the Flexibility scale (.34) but not with 
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{һе Dogmatism scale (.11). These findings question the ability of 
the Dogmatism scale to assess this attribute independently of a flex- 
ibility-rigidity construct. 

It is interesting to note that none of the methods clearly outper- 
formed the measures obtained by simple self-ratings on the attri- 
butes of interest. Not only were the convergent validity coefficients 
for the self ratings comparable to those for the other methods, but 
also the intercorrelations among the traits as assessed by this method 
(heterotrait triangles involving the self ratings) reflected the same 
interrelationships which so consistently appeared in the other hetero- 
trait triangles. These findings lend further credence to Peterson’s 
(1965) suggestion that it may be possible to adequately obtain per- 
sonalistic information without employing the lengthy inventories 
traditionally used in such research. However, the conditions under 
which it is and is not appropriate to use this direct approach re- 
main to be determined. One problem that seemingly would be im- 
portant in such self ratings is social desirability. The extent to 
which these ratings are confounded with social desirability needs 
to be examined, Furthermore, the conditions and purposes of assess- 
. ment may interact with social desirability in influencing these rat- 
ings. For example, when the assessment is for research purposes and 
subjects are assured of anonymity, these self ratings may be quite 
useable. On the other hand, when the assessment has important con- 
sequences for the individual (as in personnel selection), social de- 
sirability may become the dominant influence, thereby decreasing 
the validity and usefulness of these measures. Tt is necessary that 
these questions be examined before the encouraging results of both 
Peterson (1965) and the present study can be fully evaluated. 
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THE STABILITY OF INDIVIDUAL DIFFERENCES IN 
STRENGTH AND SENSITIVITY OF THE NERVOUS 
SYSTEM! 
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RrsrAncH on individual differences (IDs) by Russian psycholo- 
gists has differed in some major respects from the approach under- 
taken by most Western investigators. The latter have typically been 
concerned with such ID variables as intelligence, ability, and per- 
sonality. These variables have usually been measured through 
more or less standard psychometric procedures such as paper-and- 
pencil tests, self-report inventories and projective techniques. Oc- 
casionally objective behavioral or apparatus measures are taken, or 
physiological indices recorded. The paper-and-pencil paradigm pre- 
dominates, however, with scores on these tests usually being cor- 
related with scores on other tests, or with measures of learning, per- 
ception, and so on. Where human learning is concerned, to consider 
one specific area of research, distinctions have been made between 
extrinsic and intrinsic IDs (Jensen, 1967). The former are consid- 
ered to be sources of ID variance external to the learning process, 
eg., anxiety, whereas the latter are seen as sources of ID variance 
internal or intrinsic to the learning process, && susceptibility to 1n- 
terference in proactive and retroactive interference paradigms. In 
both cases, IDs are considered to contribute to learning. 

À major Russian approach to IDs is represented by the work of 
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Teplov (Gray, 1964), who has attempted to study the three Pav. 
lovian dimensions of "strength," "equilibrium" and "mobility" of 
cortical excitation—inhibition in human Ss through the measure- 
ment of sensory and intersensory phenomena such as absolute vis- 
wal thresholds (AVT) (held to reflect "sensitivity" of nervous 
functioning, and considered to be the inverse of the "strength" di- 
mension), the effect on AVT of repeated peripheral visual stimula- 
tion, the effect on AVT of auditory stimulation, the effect on ab- 
solute auditory threshold (AAT) of visual stimulation, and so on. 
Through the use of extensive batteries of such measures, with sub- 
sequent factor analysis of these measures (Rozhdestvenskaya, Neby- 
liteyn, Borisova and Yermolayeva-Tomino, 1960), it has been 
possible to establish relatively unambiguous ID dimensions, par- 
ticularly where the “strength” dimension is concerned. Pavlov 
са considered "strength of nervous activity" ог "strength of 
nervous system" to be a fundamental ID dimension in man and 
ri animals, referring to higher nervous activity and specifically 
"strength of the excitatory process" and "strength of the inhibi- 
{огу process." Teplov, who has undertaken the most extensive ap- 
plication of these notions to human behavior, has only considered, 
* strength is concerned, the "strength of the excitatory process." 
Although the Pavlovian hypotheses of neural activity held to under- 

Se thie ID FE s activity held to unde 
bobs dimensions would find little subscription in the West, the 
wioral operations used in establishing the dimensions are ac- 
ceptable procedures that can be judged on thei is. Th 
ied have test EEE i 1 : on their own basis. The 
a satisfied with reporting the dimension- 
sensory phenomena, and relating this often through experi- 

mental manipulations to putati itati 
and inhibition. р уе processes of central excitation 
e de not attempted to delineate the repre- 
eme i ek ч learning, perception and motivation. The 
hrs Sr been engaged in the development of а 
multi-dimensional approach to IDs in human learning which in- 
cludes among its i ca human learning which in 
slike in s 148 inputs a basic sensory dimensional analysis (SDA) 
йшй à ur to that of Teplov, with, however, some ante- 
the facto : lected) work of Galton. The next step beyond 

r analytic identification of : i MNA. 
taken, which is the study of the contributions oro алаш 
king ng 7 d of the contributions of these dimensions to 
wever, in attempting to integrate the Tep- 


lovian dimensi 5 
ensions into our SDA work in human learning, it be- 
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eame apparent that the stability of these dimensions required esti- 
mation. The "strength" dimension has been well established by the 
Russians (Gray, 1964) and accordingly, it was first considered as 
a potential contributor to an ID analysis of human learning. Teplov 
(1964) has reported that “sensitivity” of the nervous system, con- 
sidered to be the inverse of strength, is well represented by AVT, 
which in the factor analysis by Rozhdestvenskaya et al. (1960) was 
found to have the highest loading of all measures on & strength di- 
mension, Another measure of strength, the so-called Induction 
Method, had loadings on the strength dimension of from 0.52 to 
0.74. This method involves, in all of its variants included in the 
Rozhdestvenskaya et al. (1960) study, the influence on the thresh- 
old taken to a point of light in peripheral vision of the presentation 
of an additional strong or weak light in the visual field. The effect 
here is that the threshold is raised by addition of a weak light and 
lowered by the addition of a strong light. One of the variants of the 
Induction Method has been called the “shape of the curve” measure 
(Rozhdestvenskaya, 1955), which relates, for each S, the sensitivity 
to the principal or main stimulus to intensity of the additional stim- 
ulus, with the latter usually being varied over & wide range. An 
abbreviated variation to this method introduced by Rozhdestven- 
ауа (1959) employed but one additional stimulus (100 x thresh- 
old) and determined the effect of this additional stimulus on sensi- 
tivity to the main stimulus, This is here called the “modified shape 
Of the curve” index. А 
The present study was designed to obtain estimates of the stabil- 
ity of representative strength and sensitivity measures over а period 
of one month. The measures chosen were Rozhdestvenskaya's modi- 


fied shape of the curve index and the АУТ. 


Method 
Subjects 
Fifteen university graduate students served as Ss (mean age = 26 
Years, range 22-30). 
Apparatus 


Thresholds were obtained with an NDRC model III adaptom- 
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eter? The shape of the curve index involved the use of an adi 
tional light source peripheral to the main stimulus of the sdaptem- 
eter. The adaptometer included a red fixation point (cross), & 
rectly below which at an angular distance of 2° 17° was the mais 
stimulus for visual threshold (dia. = 36 in.), below which at aa 
angular distance of 45’ was the additional (peripheral) light саме 
(dia. = %4 in.). The angular height of all three stimuli was Y. 
These parameters were based on the report by Rozhdestvenskays 
(1955). The adaptometer was powered by a six volt D. C. current 
with the stimulus light being controlled by a variable wedge neutral 
density filter. Color was controlled with a Kodak 540 mu. filter. 


Procedure 


Ss were dark adapted while wearing red lucite goggles in а semi 
dark (15 watts illumination) 6 ft. x 6 ft. sound-reduced room for 
30 minutes. Following this, each S was moved to а similar testing 
room (6 ft. x 6 ft.) where, while seated in the threshold testing ар" 
paratus in total darkness, he was dark adapted for a further 10 mint. 
For actual testing, the S's chin rested in a Bausch and Lomb Model 
BA5372 chin rest, such that his eyes were 24 in. from the adaptom- 
eter face directly in line with the center of the main light source. 
Binocular viewing was used. The test patch was 16 in. in diameter 
and located 20° angular distance below the fixation point, with the 
angular size of the aperture being 1°30”. The S was located in 4 
totally darkened cubicle within the experimental room itself, while 
the Е was located outside this cubicle. The interior of the cubicle, 
including apparatus, was entirely matt black in finish. 

The S was instructed to fixate upon the red fixation cross. He 
was told that the E would deliver an auditory cue (pure tone) fol 
lowing which the main visual stimulus directly below the fixation 
point would be presented for 1 sec. He was to respond with a simple 
yes or no as to whether or not the light was perceived. A 
method of limits was employed, with a criterion of two consecutive 
yes-responses on the descending series and two consecutive no-Te 
sponses on the ascending series. The mean of the two series WA 
taken as the AVT. 


2 The authors would like to thank Prof. F. A. Mote of the University of 
Wisconsin for the long-term loan of this apparatus. 
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second phase of the study obtained the modified shape of the 
lex. This consisted of introducing the additional, periph: 
visual stimulus at the light source located direetly below the 
test patch, at а light value 110 X the threshold for all %. 
Se threshold to the main stimulus was then taken once in the 
bee of the additional light source. The identical procedure to 
NT was used to obtain the threshold. The 8 fixated on the red fixa- 
point throughout. 
h the AVT and modified shape of the curve dala were ex 
d in log luminance units derived from the neutral density 
at the main light source. Periodic checks with а photomulti- 
indicated the luminance to be constant at given wedge bes- 


experimental procedure took approximately 15 minutes fol- 
dark adaptation. The Se were tested between 9 am. and 2 
b, with no Ss being tested during lunch hour. The Ss were asked 
# to drink coffee or other high caffeine drinks on the day of test» 


mian work on strength as a variable held to influence ‘cortical 
lability.’ All Ss tested had normal visual acuity with no history 
уе problems. 

The retest session (Session 2) for the stability estimate was un- 
taken exactly one month after the first test (Session 1), at the 
Ше time of day, with the same Ё and under identical conditions. 
Не same procedures were used in Session 2 as were used in Sesion 
including the dark adaptation period. i 


significant (р « .001), demonstrating that the mess ы 


d significantly decreased from the first to second testing. Mete 
E E of the two sets of scores indicated no ее 


approximately 83 per cent of the variance 
Wo was accountable in terms of Session One performance. 
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Turning to the modified shape of the curve index, the mean thresh- 
old for Session One was —1.16 log luminance units, whereas the 
mean on Session Two was —.38 log luminance units. This differenee 
was significant by t test (p < .02) indicating a significant decrease 
in the mean modified shape of the curve value from first to second 
testing. A scatterplot of the two sets of scores indicated no curvilin- 
earity of relationship, but rather a positive linear relationship. The 
product-moment correlation was .61 (p < .02) demonstrating that 
approximately 37 per cent of the variance in the modified shape of 
the eurve measure on Session Two was accountable in terms of 
Session One performance. 

The major finding of the study is the marked stability of IDs in 
AVT, and the moderate stability of IDs in the modified shape of the 
eurve index. The AVT stability estimate was as high or higher 
than is usually found with intellective or personality measures, 
and clearly meets the primary requirement for use in ID studies. On 
the basis of the Rozhdestvenskaya, et al. (1960) factor analysis 
where the highest loading on a strength factor was AVT, it, might be 
suggested that taking into account the present results, this measure 
should be strongly recommended for use as an index of strength. It 
possesses high factorial validity and exceptionally high reliability. 
In Teplovian theory it is a measure of sensitivity and is thus the 
inverse of strength. This, of course, does not change the fact that 
with its high loading on the strength dimension it can be used in 
practice as a measure of strength. The high loading of AVT on the 
strength factor may, of course, be in part due to its very high reli- 
ability. The lower loadings of many of the other putative measures 
of strength in the Rozhdestvenskaya, et al. (1960) factor analysis 
may have been due to lower reliabilities, as found with the present 
modified shape of the curve index. The latter measure as well as 
many remaining strength measures are considerably more compli- 
cated procedurally than the simple AVT. The introduction of ad- 
ditional sources of stimulation into the threshold testing situation, 
as well as the longer period of time required to complete the task, 
with accompanying postural fatigue or boredom, might be eX- 
pected to lead to greater intra-individual response variability with 
a possible consequence in attenuated reliability estimates. The mod- 
ified shape of the curve reliability estimate of .61, however, is not 
discouraging enough to forsake this measure. Indeed, the safest 
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ch to the measurement of as nebulous a concept as strength 
ir through multiple-indices, among which the AVT and modified 
shape of the curve should be numbered. 


Summary 

The stability over one month of representative measures of 
#irength (“modified shape of the curve”) and sensitivity (absolute 
yisual threshold—A VT) of the nervous system as determined from 
the research and theory of Teplov and associates at the University 
of Moscow was estimated on 15 Ss. The stability estimate for the 
“modified shape of the curve” measure was .61 (p < .02) while that 
for the АУТ was .91 (p < .001). The results were discussed in re- 
gard to the Russian factor analytic identification of a dimension of 
Strength, and in relation to the choice of strength measures for 
further research. A multiple-indice approach was 
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A REVISED PROCEDURE FOR THE ANALYSIS OF 
BIOGRAPHICAL INFORMATION 


WILLIAM H. CLARK лхо BRUCE L. MARGOLIS* 
Case Western Reserve University 


WHEN examining personal history ог biographical information 
blanks (BIBs) as potential predictors in different research projects, 
the authors noted a slight discrepancy in the item analysis pro- 
cedures described by Stead and Shartle (1940) and England (1961). 

Briefly, these procedures include calculation of response frequen- 
cies and differences in proportions between criterion groups and 
derivation of “net weights” for response categories from the differ- 
ences in proportions according to Strong’s tables. The net weights 
derived from Strong’s tables range from —98 to +28. Since this 
range of net weights may appear to imply greater predictive ability 
than is in fact true, net weights are converted to “assigned weights” 
reflecting a simple positive, negative, ОГ absence of relationship 
With the criterion. Only those items achieving 2 specified net weight, 
either positive or negative, are given assigned weights and included 
in the final scoring system for the BIB. 

lt is at this point that the discrepancy between the two рго- 
cedures appears. To differentiate between the criterion groups, and 
thus to be included in the final BIB scoring system, any response 
must have a net weight whose absolute value is two or greater, ac- 
cording to Stead and Shartle (1940, p. 255). England, on the other 
hand, requires an absolute value of four or greater for acceptance 
of a response as differentiating between the groups. England's only 


Comment regarding this more stringent requirement is given a brief 


— 
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461 


42 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


footnote: the requirement “is modified from that suggested by § 
as to weight fewer chance differences between the weighting grou 
(England, 1961, p. 25). Not only is this difference in requ n 
unexplained, but also, neither procedure explains the statistical basi 
for consideration of any net weight as discriminating. 1 
Using these methods in an item analysis of BIB's develop 
two independent selection research projects, the authors also і 
themselves unable to determine the probability of Type I error. Te 
determine the reasons for the discrepancy between Stead and 
Shartle’s and England's approach and also to search for amy es 
tablished method of determining Type I error probabilities, t 
authors reviewed Strong’s original development of the item analy: 
sis procedure (Strong, 1926). 
Strong utilized Kelly's formula (Cowdery, 1926) to obtain & 
values determined from phi coefficients. In using b values Stron 
was setting up a multiple regression equation between the item pn 
dictors and the criterion. His rationale, however, for the jump 
net weights to assigned weights is unclear. The assigned weights, 
would seem, should be based upon the probability that the respons 
is discriminating. The probability that a given phi coefficient is nol 
a chance deviation from zero can be determined but а more direh 
approach is available to determine whether or not an item is dis 
criminating. For example, a simple ¢ test can permit the determin 
tion of the probability associated with differential response 
quencies for different criterion groups (such as males and femal 
or high and low performers.)? 


where: 
p! = proportion of one criterion group choosing the given response. 
p? = proportion of other criterion group choosing the given response. 
g-l-pg-1-p. 
n! = size of first group responding to entire question. 
n? = size of the second group. 

(Burtt, 1942, p. 329). 
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to responses which do not diseriminate between the groups. Correla- 
thon methods may also be used to obtain similar results, 
Strong appears to have developed his method to permit researeb- 


proportions of item responses given by the criterion groups. With 
the availability of high speed data processing, however, it becomes 
% very simple process to determine the precise probability of Type 
Lerrors. The authors felt the determination of t values and the 
probability associated with them for each response would permit а 
more clear and meaningful analysis of BIB responses. 


ties allows the researcher to use his own eriteria and judgment for 
inclusion or exclusion of items in the final weighting system. 
Empirical tests of the comparability of Strong's method and the 
calculation of t values showed a high degree of similarity between 
the two approaches. Using one set of data, the net weights derived 
for Strong’s method were correlated with ¢ values calculated from 


was obtained, Using a different set of data, а Pearson product- 
moment correlation coefficient of 85 was obtained between net 
тай found in Strong's tables and actual probabilities of Type 
errors determined from t values. 
With the advent of high-speed processing methods, the somewhat 


exact probabilities, thus allowing the 
exercise his judgment. 


Burtt, Н. E. Principles of employment psychology, (Revised ed.). 
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PROVISION FOR PUBLICATION OF VALIDITY 
STUDIES OF ACADEMIC ACHIEVEMENT 


Early in the life of this journal it became evident that the predio- 
of academic achievement is by far the most popular ama of re- 
h in the measurement field. It also became apparent that unless 
measures were taken, the journal might easily be practically 
[ lized by this subject. The heroie measure resorted to for а 
while was simply not to publish any studies on the prediction of 
"eademic achievement. 
In the course of time, it became evident that the solution hit 
was too drastic. After all, it is important that validity reports 
be available, at least in condensed form, to educational and per- 
1 psychologists and to school counselors who with to evaluate 
relative merits of the various instruments available for 
ietion of academic achievement. Furthermore, it appear 
substantial amount of validity data cannot be conveniently 
municated to professional workers in the field of measurement 
"provision is made for publication in а professional journal. 
In the light of this situation, the policy has been adopted 
“lishing а section devoted to such studies in the form of extra 
‘for which the authors bear most of the publication costs, This 
"Allows publication of the usual number of pages on 
the measurement field. The charges consist of 
"page of running text plus any extra costs which may be 
the composition of tables, figures, and formulas. 
tished one hundred off-prints without extra charge. 
Preference will be shown for manuscripts of fewer than 
“Fords, with no more than six references and containing 
‘tables each of no more than 8%” X 11” elite 
making six printed pages. Any manuseript exceeding 
Teferences, and four tables or figures equivalent to 
typed pages will be automatically „аз 12 
be the maximum total number of pages for any 
in this section. 
te Validity Studies of Academic енде 
ed twice a year, once in the Summer е 
d for which the closing dates for receiving manuscripts 
ember 30th and May , . ٤ 
| Two copies of the manuscripts should be sent to: 


Dr. William Dee 
325 Callita Place — 
San Marino, California 91108. 
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AN EMPIRICAL VALIDITY STUDY OF THE 
_ ASSUMPTIONS UNDERLYING THE STRUCTURE OF 
— COGNITIVE PROCESSES USING GUTTMAN-LINGOES 
SMALLEST SPACE ANALYSIS 


Н. W. STOKER ax» В. P, KROPP 
е Florida State University 


А 


Tur authors of the Tazonomy of Educational Objectives: Cog- 
nitive Domain (Bloom, 1956) regarded the structure of the cognitive 
Processes to be cumulatively hierarchical. Tha 
knowledge, comprehension, application, analysis, synthesis, and 
tvaluation—could be placed оп а continuum such that each successive 
elass includes all behaviors represented by the class or classes pre- 
ceding it. The authors also suggest that these classes of abilities, be- 

, oF processes transcend content. Previous studies (Kropp and 
Stoker, 1966: Smith, 1968) investigated the construct validity of 
these assumptions of cumulative hierarchy and transcendence of 
Process separately. In this study the Guttman-Lingoes smallest space 
analysis (SSA-1) technique is used in an attempt to examine and 
validate both assumptions simultaneously. 
` Guttman (1954) discussed the concepts of the simplex, circumplex, 
and the radex. If a perfect simplex occurs, certain partial correlations 
Will vanish. If 


Ta = Tuora where i<j<k 
then ra, = 0 for all i < j < k. When this relationship is satisfied, 
the resultant correlation matrix will have 
Upper-left, lower-right diagonal foll 
. the adjacent diagonal and the small 
lower-left corners of the matrix. The 
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t is, the major classes 


the largest values along the 
owed by the next larger values in 
est values in the upper-right and 
relationship is such that the col- 
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umn sums of the correlation matrix for a perfect simplex will be 
smallest at the extremes and largest for the central columns. Assum- 
ing that the correlations arose from a set of tests, the existence of а 
simplex would imply an ordering of the tests along some dimension. 
The assumption of cumulative hierarchy in the taxonomy implies 
that a set of tests designed to measure the mental processes should 
exhibit the simplical structure along a complexity dimension. 
If a uniform, perfect, additive cireumplex exists, then 


€ + Crea, Fr E Ciemat (>2п- т+1) 
ty = 
с; + сн ++ + 6-14 
+ (e+ °“ te) ф>п-т+1) 


where ¢,, represents a test (t;) with elementary additive components 
(c). Assuming that all elementary components are uncorrelated 
(fases = 0 for р = 4) and considering the special case where the 
components have equal variance and where m > n/2, then: 


1- E O<k-jen—m 
fa = 
1-®=Е+] NS we «s. 


The characteristics of a matrix of intercorrelations which satisfy 
these restrictions are that the column totals will be equal and each row 
of the table will have the same entries as the preceding row, but 
moved one column to the right, the end one moving to the beginning. 
A matrix exhibiting these characteristics was referred to as а “cil 
culant” by Guttman (1954). If process transcends content, аз SUg- 
gested by the authors of the Taxonomy, then tests at the same prot- 
ess level should have higher correlations than tests of different process 
levels, implying that one should find a cirumplex when dealing with 
taxonomy-type tests. 

When the correlations do not meet the rigorous requirements for 
either a simplex or a circumplex but the matrix exhibits the char- 
acteristics of one or the other, the matrix may be classed as & quasi- 
simplex or a quasi-cireumplex. 

In a radex, both types of ordering occur simultaneously and two 
dimensions should suffice for representing the relationships. In this 
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sudy, the relationships between tests of differing process levels 
should decrease as the distance between them on the complexity di- 
mension increases; tests of the same process level should have higher 
intercorrelations than tests of different process levels. These rela- 
tionships will be reflected by the distance between the test when plot- 
ted in two-space. [For additional background, the reader is referred 
io two other contributions by Guttman (1964, 1968) ]. 

Procedure. The data analyzed were from a study conducted by the 
authors, Kropp and Stoker (1966). Taxonomy-type tests, designed to 
measure the six levels of mental processes as defined in the Taxon- 
omy, were constructed for each of four contents, Tests were based on 
reading passages presented to the students and available to them 
while responding to test items for all but the synthesis and evaluation 
levels. Reading passages were selected on the basis of ease of compre- 
hension, interest value, and unfamiliarity of the material to the stu- 
dents. The tests, which were entitled, “Atomic Structure,” “Glaciers,” 
“Lisbon Earthquake,” and “Stages of Economie Growth,” are the 
same ones used in the Smith studies (1968). 

Tests of knowledge, comprehension, application, and analysis each 

contained 20 multiple-choice questions. Each synthesis test 
of five free-response questions, scored zero to four; each 
test consisted of 10 free-response items, scored zero to two. Thus, а 
maximum of 20 points could be earned on each test. Tests were ad- 
ministered {о students in grades nine through 12. Correlation coef- 
ficients were derived for each grade level and for all grades combined. 
The number of scores contributing to each coefficient, varied, but 
Within grade, the number exceeds 750 in each case. 
_ The five 24 x 24 intercorrelation matrices formed by intercorrelat- 
ing scores within and over grades were subjected to an analysis eod 
the Guttman-Lingoes SSA-1 computer program. The solution de- 
Бей below is the one obtained by hypothesizing that two dimen- 
sions would represent the data. Other solutions were tried, but the use 
oftwo dimensions appeared to yield the best solution. 

If the assumptions of cumulative hierarchy and transcendence are 
‘enable, the graphs produced by the well known Guttman-Lingoes 

А-1 program should provide evidence of а radex. One dimension on 
the graph should represent complexity of process and the other the 
generality of process over content. Hence, the combination of ths 
pecially constructed tests and the analysis technique provides а 
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means by which the construct validity of the underlying assumptions 
can be examined. 

Results. Graphs representing the data for grades nine through 12 
and all grades combined were constructed. Figure 1 contains the solu- 
tion for all grades combined, The solution for grades nine through 12 
yielded different graphs, but with some similarities, The graph for all 
grades combined is provided for illustration of the results of such a 
solution. The solution represented by the figure is the one resulting 
from the specification of two dimensions. Coefficients of alienation, 
which refer to the spread in the scattergram yielded by the program, 
were approximately .20. These coefficients indicate the degree to which 


К = Knowledge 

С = Comprehension 
Ар = Application 

А = Analysis 

S = Synthesis 


Е = Evaluation 


Figure 1. GL-SSA-1 solution for two dimensions all grades combined. 
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ty = Í (dy), where f is a monotonically decreasing function. A zero 
value would represent a perfect fit. 

In all five graphs there was some evidence of the existence of а 
radex. The “center” for grades 9, 10, 12 and all grades, was the set of 
tests measuring comprehension with application, analysis, synthesis, 
and evaluation radiating out in order from this center, For grade 11, 
the “center” was a mixture of comprehension and application. 

The notable exception of the hypothesized outcome, in each figure, 
was the set of tests designed to measure knowledge. The grouping of 
knowledge tests for each grade reflects the high intercorrelation of 
the knowledge tests coupled with a somewhat random distribution of 
correlations of knowledge tests with other tests. The fact that these 
tests were designed to maximize the number of individuals receiving 
a perfect score probably accounts for the pattern of correlations ob- 
` tained and the separation in two-space of these tests from the others. 

A somewhat similar patterning appeared in the graphs for grades 
11, 12 and for all grades, with respect to the evaluation tests. These 
tests proved to be extremely difficult, yielding markedly skewed dis- 
- tributions. While there was some correlation among the evaluation 

tests (average r = .50), there was no definite pattern in the corre- 
lation of the evaluation tests with tests at other levels. The lack of a 
pattern, coupled with the intercorrelation among the evaluation tests 
probably explains the grouping pictured in these graphs. 
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VALIDITY OF TAXONOMIC TESTS 


I. LEON SMITH 
University of Cincinnati 


Ix order to validate tests derived from Bloom's (1956) Taxonomy 
of Educational Objectives, it is necessary to base items on content for 
which students have relatively the same mastery во that score varia- 
bility will reflect differential mastery of the cognitive processes. The 
latter are defined as Knowledge, Comprehension, Application, Analy- 
sis, Synthesis, and Evaluation. 
In a recent attempt (Kropp, Stoker, and Bashaw, 1966), the 
method for controlling content mastery involved the introduction of 
material, at the time of testing, which was assumed to be equally un- 
familiar to all students. This was accomplished by presenting the 
students with a reading passage and basing the taxonomic items on 
its content. This study reports on the use of this type of response 
Measure in an effort to hold content knowledge constant. * 
Method. The Kropp et al. (1966) test of the Stages of Economie 
Development (SED) and the Social Studies (SS) subtest = noe 
Stanford Achievement Test: High School Battery were 
to 141 eleventh-grade students. The IQ range of the group was 67 to 
» аз measured by the Lorge-Thorndike Intelligence Test, Level G, 
orm 1. 
Results and discussion. Support for the use of unfamiliar deci 
аз a method for controlling content knowledge would be indicated if 
the relationships between the SED tests and the measure of subject- 
matter mastery (SS) were low and insignificant. However, m iip 
of the substantial correlations between the SED tests and the 88 
measure as shown in Table 1, there is the strong suggestion that per- 
formance on each taxonomic test depends heavily on relevant past 
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TABLE 1 

Reliabilities and Correlations among Tests 

88 Reliability* 
Knowledge .73* St 
Com .73* 81 
Application .04* 86 
zn 71* 74 
72* 72 
Evaluation 60° 71 
88 88 


achievement. It appears that content knowledge is not controlled by 
the presentation of unfamiliar material. Thus, the interpretation of 
score variability in terms of differential mastery of the cognitive 
processes may not be warranted. Since the use of any response mea= 
sure is likely to permit relevant past knowledge and experience to be 
transferred to the testing situation, the validity of the process levels of 
the Taxonomy may be difficult to establish. However, the results 
are encouraging in the sense that the measure of standardized 
achievement does seem to tap all of the behaviors postulated by the 
Taxonomy as measured by the SED test. 


REFERENCES 


Bloom, B. S. (Ed.) Taxonomy of educational objectives. New York: 
Longmans, Green, 1956. 
Kropp, R. P., Stoker, Н. W., and Bashaw, W. L. The construction. 
nd validation of tests of the cognitive processes as described. 
în othe taxonomy of educational objectives. Сора ы Reg 
search Project #2117, U. S. Office of Education, Institute 
Human Learning, and Department of Educational Research hand 
Testing, Florida State University, 1966. 


Кэесатзенат. aw PSYCHOLOGICAL MEASUREMENT 
pni, 31, 477-484. 


MEASUREMENT OF COLLEGE ACHIEVEMENT BY THE 
COLLEGE-LEVEL EXAMINATION PROGRAM 


AMIEL T. SHARON 
Educational Testing Service 


Tue General Examinations (GEs) of the College-Level Examina- 
tion Program (CLEP)? are intended to provide & comprehensive 
measure of undergraduate achievement in five basic areas of liberal 
arts: English, natural sciences, humanities, mathematics, and social 
seiences-history. The tests are not designed to measure advanced 
training in any specific discipline but rather to assess a student's 
knowledge and comprehension of basic faets, concepts, and principles 
in each of the five subjects. The content covered by the GEs is simi- 
lar to the content included in the program of study required of many 
liberal arts students in the first two years of college. It has been de- 
veloped by committees of specialists in each of the subject-matter 
fields. The committees work with test specialists in defining the topics 
to be covered, reviewing the test specifications, and suggesting and re- 
Viewing test questions. 

In addition to being used for granting college credit or placement for 
military service experiences, television and correspondence courses, 
and independent study, the GES are used for a variety of other pur- 
poses at collegiate institutions. They are employed for guiding stu- 
dents into appropriate curricula of study; admitting and placing 
transfer students; assessing student growth in various curricula; and 
selecting students for upper division studies. Many colleges and uni- 
versities are also using the examinations for self-study and for re- 
Search on specific questions about types of students, courses, ОГ 


ee * * 
VThe CLEP. which is sponsored by the College Entrance Saas 
ard, includes both the General and Subject Examinations, 
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curricula, The questions which are asked range from “How do our 
sophomores compare with those at other colleges in terms of their 
liberal arts education?” to “Does exposure to our liberal arts courses 
result in greater knowledge as measured by these tests?” 

The most common procedure for demonstrating the appropriateness 
or validity of achievement tests, such as the GEs, is by means of con- 
tent validation, The test content is developed systematically to be 
representative of the subject matter to be measured. In addition, em 
pirical procedures such as item analysis aid the test specialists in de- 
ciding on which items to include in the examinations. Since the GE: 
have been constructed by rigorous procedures of content validation 
described elsewhere (ETS, 1965), the present report focuses on the 
empirical validity of the tests. 

Two different types of empirical validity will be discussed: orite- 
rion-related validity and construct validity. Criterion-related validity 
is useful for prediction of future performance and assessment of cur- 
rent achievement level. The criterion-related validity of the GEs will 
be described in terms of the relationship of the tests to college grades. 
Although the grade-point average (GPA) criterion has been criti- 
cized for being unstable and for failing to reflect certain desirable 


types of student traits such as ethicality, openmindedness, altruism, | 


maturity, and self-insight, its ready availability has promoted its 
use as a criterion of college success by many researchers. 

Unlike criterion-related validity, construct validity aims to in- 
crease understanding of the educational or psychological attributes 
measured by a test. It requires the gathering of information from 8 
variety of sources. The construct validity of the GEs will be de 
seribed by the inferred effect of college instruction on test perfor 
mance and by the differential performance of various types of stu- 
dents on the examinations. The possibility of the examination 
being inappropriate to certain types of students, a topic closely 
related to validity, will also be discussed. 

Criterion-related validity. Positive correlations between the GEs 
and overall GPA, in most cases overall sophomore GPA, have been 
reported in studies conducted at six universities. Since GPA and 
scores on GEs were collected simultaneously in these studies, these 
correlations represent the concurrent validity of the examinations. 
Invariably the English Composition Test was found to be the 
valid one, with a median coefficient of 46. The rank order of the Vf" 
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Ыйу coefficients of the four other examinations was not consistent 
across the different studies. Median validities were Natural Sciences 
A0, Humanities 40, Social Sciences-History 36, and Mathematics 
These correlations indicate that there is a moderately positive, 
{аг from perfect, relationship between the tests’ scores and 
grades. This result is not too surprising, since grades in many courses 
аге based on objective tests similar in content and format to the 
„ Nevertheless, these results suggest that the tests can be used 
legitimately for granting course credit or placement in college. 

The correlations between the GEs and grades in subjects corre- 
sponding to each test are in general no higher than the test's correla- 
tions with overall GPA. This conclusion is based on studies conducted 
at two universities. A probable explanation of these resulta is that 
overall GPA is more reliable than subject GPA because it is based on 
alarger number of courses. 

The validity of the GEs when taken at the end of the sophomore 
year, for predicting junior or junior/senior grades, is significantly 
lower than the concurrent validity of the tests, Median validity co- 
eficients computed on the basis of three studies were English Com- 
position 36, Humanities 28, Natural Sciences 27, Social Sciences- 
History .26, and Mathematics .15. Again, the English Composition 
and the Mathematics Tests appear to be the most and least valid 
tests respectively. The reason for the low validity of the Mathe- 
matics Test may be that mathematics plays а Very minor role in 
tourses taught in the last two years of college. The finding that the 
predictive validities of the GEs are lower than their concurrent và- 
lidities indicates that the tests are less useful for guidance or pre- 
diction of success in upper-level studies than they are аз measures 
of current achievement level. 

Construct validity. Construct validity indicates the extent to which 
4 test can be said to measure a trait or а theoretical construct. It also 
refers to the ability of a test to yield reasonable results, consistent 
vith expectations. For example, a scholastic achievement test should 
Yield higher scores for those who have more education than for those 
Who have less education; history majors should score higher on а 

istory test than biology majors; and students should have higher 
sores on an algebra test after taking an algebra course than before 

ing the course. JOM А 
There are two reasonable expectations or implicit assumptions 
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underlying the College-Level Examination Program which have 
implications for the construet validity of the GEs: 

1. There is а gain in knowledge resulting from college instruction 

which can be measured by an examination. 

2. The examinations employed to measure gain in knowledge are 

appropriate to the courses taught at the colleges. 

‘These assumptions have implications which extend beyond thos 
underlying the coefficient of correlation. In demonstrating that them 
is a positive correlation between test scores and grades no claim eat 
be made that test scores or grades are affected by instruction. Is 
order to determine whether a change in test performance is influences 
by college instruction, it is necessary to administer the test befor 
and after the course of instruction. Also required would be the test 
ing of one or more control groups (to which students would be ran 
domly assigned) who would not receive instruction appropriate t 
the test or any instruction at all. Without a control group, any gain 
achieved on the examinations could be interpreted as resulting fror 
intellectual growth rather than from а specific course of study. Un 
fortunately, it is difficult to have control groups in educational re 
search. The notion of “manipulating” the learning of students fe 
the sake of research is anathema to many educators. None of th 
studies which employed а "before-aíter" design to study score gait 
on the GEs employed a control group. i 

Harris and Booth (1969) reported on gains made on the GES frot 
the first to the sixth quarter by a group of 177 students who had take 
the test twice. The mean gains ranged from a high of .6 of a standar 
deviation for the Social Sciences-History Test to a low of 3 of 
standard deviation for the Mathematics Test. In relating the gait 
made on the GEs to grades in the courses corresponding to.each te 
different results were found for the five tests. Students with high 
grades achieved greater gains on the Humanities, Natural Science 
and Social Sciences-History Tests only. The authors conclude th 
“on the average the better students in the various courses come in 
those courses with better scores on the respective tests and she 
greater gains” (p. 5). French (1965) described mean gains on the fi 
examinations for a group of 81 students. These gains-are similar 
pattern and magnitude to those reported by Harris and Booth. Ко 

(1969) related gains to relevant course experiences for a sample 
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2 students tested twice. Significant gains were made by the studenta 
only on the English Composition and Natural Sciences Tests. 
The score gains reported in the three foregoing studies do not neo- 
warily indicate that a particular college has done а good job or а 
poor job. The GEs are designed to cover subject matter content as 
taught at different colleges with different curricula, methods, and 
materials. They do not necessarily reflect all the objectives and em- 
of any one college. In addition, the lack of control groups 
makes it difficult to know whether the score gains were a result of 
instruction or simply a result of maturation or intellectual growth 
occurring within the first two years of college. 
The relationship of the GEs’ scores to amount of previous instrue- 
"| tion in a subject generally provides support for the validity of the 
examinations as measures of academic achievement. A relationship, 
however, does not prove cause, and thus it cannot conclusively dem- 
onstrate that the scores are affected by instruction. Nevertheless, & 
lack of relationship between the GES’ scores and amount of previous 
instruction would have led one to question the validity of the tests. 
Beanblossom (1969) correlated sc ae 
lege credits taken in corresponding subjects. coneluded on 
“definitely” 


tive factors, however, such as students taking more courses in their 
strong subjects, could account for these results. , 

The expectation that the tests’ scores increase with the amount of 
formal college education completed has been confirmed by an analy- 
_Sis of the scores of 44,000 servicemen tested through the United States 
_ Armed Forces Institute. There appears to be a steady and significant 
Progression of scores on all tests from those who have completed 
high school to those who have completed four years of college. Serv- 
icemen completing four years of college score about one 
deviation higher on each of the examinations than those who have 
Dot attended college. 

"| Тһе relationship of amount of high school preparation i the tests’ 
7| scores was determined with the national freshman norming sample 
consisting of about 2500 second-term college students. Although the 
| &àminations were not intended to measure high school achievement, 
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scores on all tests correlated positively with the number of years of 
appropriate course work completed in high school. 

Additional results relating to the construct validity of the exam- 
inations have emerged from the data collected with the national 
norming sample of approximately 2600 college sophomores. The 
scores of sophomores intending to major in different fields fell into 
expected patterns. The highest mean score on each of the five exam- 
inations was obtained by students intending to major in the field cor- 
responding to the examination. For example, those intending to ma- 
jor in social sciences performed best on the Social Sciences-History 
Test while those majoring in humanities or fine arts scored highest 
on the Humanities Test. 

The intercorrelations of the GEs indicate that to some extent all 
of the examinations except Mathematics are measuring the same abil- 
ity or abilities—perhaps reading comprehension. The median inter- 
correlations found in five studies ranged from a low of .12 between 
Humanities and Mathematics to a high of .56 between English 
Composition and Humanities. It should be pointed out, however, 
that the intercorrelations are much lower than expected of reliable 
tests (above .9) measuring the same factors; thus, it is apparent 
that each test is also measuring some unique knowledge or skill. 

Although the factorial composition of the GEs has not been deter- 
mined, one could guess on the basis of the intercorrelations that two 
factors would account for most of the variance on the tests. The 
Mathematics Test would load high on a mathematical factor while 
the four other examinations would load high on a verbal factor. 

Appropriateness of the tests for adults. One of the major target 
populations of the College-Level Examination Program consists of 
mature adults who have not had any formal education in college. 
The content of the GEs, however, is based on the program of study 
offered to freshmen and sophomores attending liberal arts colleges 
who are mostly in their late teens. Does the content or the format of 
the examinations place the older candidates at a disadvantage? 

Ananalysis of the scores of approximately 44,000 servicemen 0n the 
GEs appears to suggest that the tests are no more difficult for the older 
than for the younger examinees. The oldest age group in this analysis, 
consisting of those of age 40 and over, was not the lowest scoring 
group on any of the examinations. In fact, this group had the highest 
mean score of any age group on the Social Sciences-History and Hu- 


ulated value of life experience. The highest scores оп (Ье three 
examinations occurred in the 22 to 24 age range. A limiting 


ation of the results is that the older servicemen in the sample 
Were higher in ability or motivation as a result of self-selection. 
French (1969) investigated the GEs’ appropriateness with а sam- 
ple of adult and black students. By using an inverse factor analysis 
оп а matrix of the GEs’ item responses he was able to identify 20 
distinct hypothetical types of student, each defined by а certain set 
items. Although the results suggest that the GEs do not give spe- 
cial advantage to any type of students, such as blacks or adults, it is 
difficult to have confidence in these results because the group of sub- 


portant factor for adults than for younger persons, and it might con- 
Sequently invalidate the tests as measures of achievement for adults. 

Conclusion. In general, the research summarized provides support 
for the validity of the GEs as measures of academic achievement. 
Many of the studies reviewed, however, do not lead to definitive con- 
clusions. Results showing score gains after course exposure and posi- 
_tive relationships between the tests and amount of previous instruc- 
tion have alternative interpretations. Correlations between the GEs 
and college grades obtained concurrently are moderately positive, but 
the validities of the tests for predicting success in upper-level studies 
are significantly lower than their validities for assessing current 


achievement level. The research methodology for validating the GEs 


. сап be i loying criteria other than grades, by NE 
improved by employing partialing out contami- 


Control groups in score-gain studies, and by e 
“nating factors in correlational studies. Nevertheless, the relationships 


found between the GEs and certain relevant variables provide ten- 
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tative support for the validity of the tests as measures of college 
level achievement. 
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CONCURRENT VALIDITY OF A LITERATURE 
TEST IN RELATION TO SELECTION OF PERSONS 
FOR GRADUATE STUDY IN ENGLISH 


JOSEPH P. SCHNITZEN ax» JOHN A. COX 
University of Houston 


Problem. This study was performed to answer two questions: Is 
the field test of the Undergraduate Program (UP) in Literature 
(Educational Testing Service, 1969) а valid predictor of academic 
achievement in English as estimated by grade point average? If во, 
What level of test performance would be useful in selecting among 
applicants for graduate study in English? 

Measurements. The instrument selected аз a measure of academic 
achievement in literature was the field test, Literature, UP from 
Educational Testing Service (1969). It is a two-hour examination 
covering the following areas: poetry, fiction, drama, non-fiction, 
world literature, pre-Shakespeare, Shakespeare, Fnglish sj den 
tan literature post-Shakespeare, and poetic metrics, It was judged to 
have adequate content validity. The test is recommended for eval- 
uating academic achievement in literature at the college undergrad- 
tate level by the publisher. : 

Cumulative grade point average (GPA) was пем e 
terion of interest. This “score” was computed from grades a ад 
courses attempted by a student except required physical education 
Where a grade of A = 4 and F = 0. An English grade point average 


Was also used as a criterion, It was computed using only grades from 

courses in English. For graduate students, only graduate work in 

English courses was used to compute a GPA. d 
Procedure, Three groups of students were of interest: seniors from 


the program preparing persons for teacher certification in English 
485 
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(Eng. TE), seniors who were English majors (Eng.), and first-year 
graduate students in English (Grad.). The English department iden- 
tified persons in the three groups. In Eng and Eng TE, 75 persons 
from each group were randomly selected, and letters were sent to 
each person strongly suggesting that the student come for testing at 
one of two specified times. Among graduates a similar technique was 
used, 50 letters being sent. No procedure was available to require that 
the student be tested. 

Test administration was performed by personnel from the Coun- 
seling and Testing Service during May, 1970. In all 32 Eng TE, 
35 Eng, and 33 Grad students were tested. Each of these persons 
was essentially a volunteer. 

When testing was complete a search of student records was made 
in an attempt to find the grades made by each student for whom 
test scores were available. Grades were located for 31 Eng TE stu- 
dents, 33 Eng students, and 33 Grad students. In the Grad group 
two students who had completed fewer than three courses were 
eliminated from the correlational sample. Using the grade sheets, 
cumulative (Cum) GPA and Eng GPA were computed for Eng TE 
and Eng students. Only Eng GPA which included solely graduate 
English course grades, was computed for students in Grad. 

From these data distributions, means and standard deviations for 
the Literature, UP test were computed for each group. These fig- 
ures have been placed in Table 1. Correlation coefficients were com- 
puted for each GPA distribution and for the Literature test score 
distribution. Scatter diagrams were prepared and coefficients were 
a from the diagrams. The results have been placed in Table 

Results. The figures in Table 1 show that there was about one- 
half a standard deviation difference between the mean literature 
Scores of the Eng TE seniors and that of the Eng seniors, with the 
Eng TE mean being lower. This difference was statistically signifi- 
cant (t = 1231). The observation is in line with expectation, since 
Eng TE students usually take about two fewer courses in English 


than do Eng students, Furthermore, the Grad students scored more 
than one standard devi 


А Viation higher on th than did the Eng 
Seniors, a difference th чии 


at was also statistically significant ( = 23.47)- 
M members of the Grad group tended to be at the end of their 
Year of graduate study and to have completed an additional 


Scores Eng. TE Eng. Grad. 


750-775 1 
725-149 
700-724 
675-699 
650-674 
625-649 
600-624 
575-599 
550-574 
525-549 
500-524 
475-499 
450-474 
425-449 
400-424 
375-399 
350-374 
325-349 
Mean 
SD 
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year’s course work above that of Eng group members, this finding 
was also expected. The data show that, while there was minor over- 
lap in the three distributions, the Literature, UP test diseriminated 
among the three groups of students in үм кєз direction and 
that the differences among means were rather large. 

The correlation between Literature, UP test BUDE and хан 
(see Table 2) was .15 for Eng TE which is not юса hi ч 
ànd the corresponding correlation for the Eng group was 40, whic 
is statistically significant аф the .05 level. Thus, the test scores ex- 


TABLE 2 
Correlation Coefficients between Grade Point Averages and Literature Test Scores 
Literature Test Scores 
Group 

= т F 
«жы... ТЕ Seniors Eng. Seniors Eng. Grad. 
gum. GPA 158 .40* E 
Eng. GPA "25 "A8" 2 
шц 31 33 


* Significant beyond the .05 level. 
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hibited concurrent validity for overall academic performance as ee 
timated by course grades only among English majors. When perfor 
mance in English courses only was the criterion (Eng GPA), tbe 
relationships for test scores versus performance were somewhat 
higher. However, among Eng TE seniors the concurrent validity 
was not statistically reliable (r = .25). Neither was the concurrent 
validity among the Grad group (r = 32). Among the Eng group 
the relationship was significant beyond the .05 level (7 = 48). 

Conclusions. The Literature Test, UP has demonstrated evidence 
of construct validity in that scores from the test differentiated among 
groups of persons who had completed different amounts of course 
work in the field of English. The amount of differentiation was & 
practical amount. Among college senior English majors the Litera- 
ture, UP test scores were moderately predictive of academic achieve 
ment. While the relationship between Literature, UP test scores and 
graduate academic achievement in English was not high enough to 
be statistically reliable in this study, the results were perhaps encour 
aging. In this study, the graduate students had already completed 
& year's work in graduate English courses. Thus they represented 
& selected sample. Each had applied for graduate school admission, 
had been evaluated on undergraduate performance and Graduate 
Record Examination performance, and had been accepted into grad- 
uate school before completing his course work. Had the group of 
senior English majors in this study been entered into graduate study 
of English without, selection, the relationship between their tes 
performance and graduate academic performance would have likely 
been higher than that found here for the Grad sample. 

This study had shown that the Literature, UP test was valid in з 
construct sense and had concurrent validity among senior English 
ean there was an answer to the first question to which 

was addressed. While no direct answer to the second 
мер the basis for making a judgement as to a specific 
tributions Literature, UP test was furnished in the score dis- 
among the Eng and Grad groups 

Summary. Literature field i 

test of the Undergraduate Program WAS 

administered to 32 English teacher education maj English 
majors, and 33 E оп majors, 35 

be related to GP. "^s 1 graduate students. Test scores were found to 

among senior English majors, Mean scores for 


Testing Service. козү: for Deans 
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PREDICTING QUALITY POINT AVERAGES IN 
MASTER'S DEGREE PROGRAMS IN EDUCATION 


JERRY B. AYERS 
Tennessee Technological University 


Тн prediction of success, as measured by quality point average, 
in Master's degree programs in education is a major concern 
graduate schools. With increases in enrollment, there is a need to ex- 
mine the quality point average prediction schemes that are in use by 
regional state universities. In studies conducted by Nunnery and 
Aldmon (1964), Owens and Roaden (1966), and Herbert (1967), 
the undergraduate quality point average (UQPA) was found to be 
the best predictor of graduate quality point average (GQPA). Mil- 
ler (1970) summarized а number of predictive studies in which the 
Miller Analogies Test (МАТ) has been used to predict СОРА. Her- 
bert (1967) indicated a need for studies of the prediction ої СОРА 
using а combination of the ТОРА, MAT, and the National Teacher 
Examinations (NTE). Miller (1970) reported a correlation of .71 
between the MAT scores and weighted total scores of the Common 
Examination of the NTE. 

The inclusion of a measure of the ability of the graduate student 
to use the English language effectively in formal course work and 
in the preparation of papers, reports, and theses has been neglected 
in predictive studies of СОРА. It would appear that а test requiring 
use of the English language should be а significant predictor of 
success in a Masters program in education. An instrument such as 
the New Purdue Placement Test in English (PET) can sample а 
student’s knowledge of “good English” (Wykoff, McKee and Rem- 


mers, 1955). ална 
Purpose. The ose of this study was to determine the Te тй 
e nly used graduate school 


tionship between each of several commo 
491 
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admission criteria viewed as predictors and success in graduate 
school as measured by СОРА for graduates who had completed 
the Master of Arts in education at a regional state university. 

Variables. The admissions criteria for full standing in а graduate 
program in education required a student to have completed his 
undergraduate program with a minimum UQPA of 2.50 (on a 4.00 
scale) and to have completed the PET and MAT. In addition, а 
limited number of students presented scores on the NTE. For pur- 
poses of this study scores on the following parts of the NTE were 
used: Teaching Area Examination (TAE) ; Professional Education 
Seores (PES) ; Written English Expression (WEE) ; Social Studies, 
Literature, and Fine Arts (SLF); Science and Mathematics (SAM); 
General Edueation Subtotal (GES) ; Weighted Common Examina- 
tion Total Score (WCE), and Composite NTE Score (CNS). The 
major criterion for success in graduate school was the completion of 
all requirements with a minimum GQPA of 3.00. 

Sample. The sample was composed of those graduates (N = 241) 
who had completed the Master of Arts in education between June 
1963 and August 1970 at а regional state university. All graduates 
had completed their programs of study with major emphasis in ed- 
ueational administration and supervision, curriculum and instruc- 
tion, or guidance and counseling and had completed the MAT and 
PET. The NTE had been completed by 39 subjects. 

Results and discussion. Intercorrelations, means and standard de- 
viations for СОРА, UQPA, MAT and PET for each of the three 
Paes and the total Group of graduates are presented in Table 1. 

UQP ‘A and PET did appear to have a greater effectiveness in 
predicting км than did the MAT for the graduates of the ad- 
| supervision and the curriculum and instruction Pro” 
VAI ae to UQPA were in agreement with 
(1966), and афтае Aldmon (1964), Owens and Roaden 
GQPA; бос thess-two ). The correlation between PET and 
mighi be атф was highly significant. This finding 
dió: MA A | Piae had completed a substantial 
grades had been based largely ОП 
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TABLE 1 
Intercorrelations, Means, and Standard Deviations GQPA А, MAT, 
end PET for Graduates in Three Different and for the Grou? 


= 


UQPA MAT PET Mean 8р 


Administration and 
Supervision (V = 86) 


GQPA 41 A 3.4 0.3 
UQPA 39 E 2.6 0.4 
MAT E: m3 M4 
РЕТ 1324 26.5 
Curriculum апа 
Instruction (№ = 47) 

GQPA .69 43 E! 3.7 0.3 
UQPA .36 .52 3.0 0.5 
МАТ .60 38.0 14.5 
PET 150.9 28.0 


Guidance and 
Counseling (N = 108) 
A 


MA 147 

a Curricula (N = 241) 40 50 $5 03 
MA "34 E 2.7 0.4 
d 165 364 14.9 
МА | uw; 27.6 


* All correlations are significant at or beyond the .01 level. 


criteria required in earning grades st the undergraduate level and in 
performance on the PET. 

The lack of substantial correlation between the criterion GQPA 
and each of the three predictors UQPA, MAT, and PET for gradu- 
ates of the guidance and counseling program could be interpreted e 
part, since the curriculum involved performance criteria in laboratory 
course situations such as counseling techniques, test administration 
and interpretation, counseling interviews, and guidance : 
periences. Grades in courses of this nature reflected less emphasis on 


relations of MAT with СОРА for all groups and the total group were 
comparable to those previously reported (Miller, 1970). Multiple cor- 
relations, using the stepwise regression technique, for each of the 
groups were as follows: administration and supervision, 349; cur- 
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riculum and instruction, .563; guidance and counseling, .158; and 
all curricula, .332. The variables in this study served as the best 
predictors of success for the curriculum and instruction program in 
that 31.7 per cent of the variance was explained. 

Table 2 presents the intercorrelations, means, and standard devia- 
tions for all variables and various subtests of the NTE for a limited 
group of graduates, An examination of the intercorrelations of the 
various measures indicated that the UQPA was the best overall pre- 
dictor of GQPA. The correlation of the MAT with WCE was 66, 
which is in agreement with the value reported by Miller (1970). Using 
the stepwise regression technique, the equation for the prediction of 
GQPA was as follows: 

GQPA = 1.770 + .285UQPA + .002PET 
+ QOITAE + 004PES — .005WEE — 000SLF 

The multiple R for this equation was .554 which accounted for 
about 30.7 per cent of the variance in the criterion variable. The MAT 
did not enter into the equation. 

Conclusions. The correlations between MAT and GQPA were 
typical of those found in the literature. They were interpreted as 
justifying the continued use of the MAT when only a single predictor 
is used. The validity coefficients presented in this study appeared 
also to justify the use of the UQPA and/or PET as predictors of 
GQPA in all curricula. The introduction of selected scores from 
the NTE when combined with UQPA and PET would seem to 
enhance the predictive qualities of the GQPA. 
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ANOTHER. CONTRIBUTION TO ESTIMATING 
SUCCESS IN GRADUATE SCHOOL: A SEARCH FOR 
SEX DIFFERENCES AND COMPARISON BETWEEN 

THREE DEGREE TYPES 


DAVID A. PAYNE, ROBERT A. WELLS, AND ROBERT R. CLARKE 
University of Georgia 


Tue literature of academic prediction in graduate education is 
filled with studies demonstrating correlations of various selection 
devices and grade point averages. The favorite devices are of course 
the Graduate Record Examination and the Miller Analogies Test. 
Results using these tests have certainly been unpredictable with re- 
ported validities ranging from .08 for the GRE-Verbal (Newman, 


ranging from .00 (Travers, 1948) to .69 (Gustad, 1950) have also 
been noted. Studies have yielded more supportive evidence for the 
MAT, with validities in the mid 20's and 30's. (e.g. Payne and 


Tuttle, 1966). Those concerned with selection in colleges of educa- 
tion will, however, find few readily available data on the potentially 
applicable and useful National Teacher Examinations. 


To describe comparative validities between these three instruments 


was one of the purposes of the present investiga 
most predictive studies tend to focus on masters 


ation. In addition, since 
or doctoral students, 


or confound conclusions by combining these groups, prediction of 
success was concerned with a comparison of these two types of 


graduate degree programs. A few 
acquisition was the masters. Now if one does not p 
опе must at least secure a sixth 


years ago the ultimate in degree 
ursue & doctorate, 


year certificate. This alternative be- 


came the third degree type. And finally, a perusal of the literature 
reveals very little concern with sex differences in predictability of 
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suecess in graduate school. At other levels it is known that females 
are more predictable than males (Seashore, 1962). 
Variables. Predictors involved in the present study were as follows: 


1. Time to complete (TIME). Number of months from beginning to 
end of program or graduation. Doctoral students were measured 
from end of masters program. 

2. National Teacher Examination—Common (NTE-C). 

3. National Teacher Examination—Optional (NTE-O). One of 13 
possible 80 minute examinations in various teaching areas. 

4. Graduate Record Examination—Verbal (GRE-V). 

5. Graduate Record Examination—Quantitative (GRE-Q). 

6. Graduate Record Examination— Total (GRE-T). Simple addi- 
tion of V and Q. 

7. Miller Analogies Test (MAT). Raw scores used. 

8. Undergraduate Grade Point Average (UA). Maximum average — 
100. 1 

9. Graduate Grade Point Average (GA). 


Samples. All graduates of the College of Education during the 
academic year September, 1968 to August, 1969 composed the total 
group from which subgroups were determined. Theoretically data 
were ayailable on a total group of 685 students, consisting of 314 
males and 371 females. During this academic year 58 doctorates 
(both PhD and EdD), 503 master of education degrees, and 124 
sixth year certificates were awarded. Applicants to the College of 
Education may submit scores on either the GRE, MAT, or NTE. 
Therefore, all students did not have scores on all tests. This fact 
may somewhat confound the interpretation of results. The sub- 
sample sizes are large enough in most cases to allow for a reasonable 
degree of confidence in the results. It was decided arbitrarily that 


апу groups containing fewer than 10 subjects would not undergo 
correlational analysis, 


$ Results and conclusions, Means, standard deviations, and correla- 
tions for the samples of 


Masters degree, sixth year certificate, and 
doctoral degree recipients are summarized in Tables 1, 2, and 3 re- 
pte Examination of Table 1 reveals that for the Masters 
People UA was about the best predictor of success. In addition it ap- 
гањ re females Were somewhat more predictable than males. This 

conclusion regarding greater female predictability was confirmed 


TABLE 1 
Descriptive Statistics Relating to Prediction of Success for Masters Degree Recipients 


DAVID А. PAYNE, ET AL. 


Total 
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2268958583 
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2128354558 
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'TABLE 3 
Descriptive Statistics Relating to Prediction of Success for Doctorate Recipients 
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Predictor* 
TIME 
NTE-C 
NTE-O 
GRE-V 

GRE-Q 

GRE-T 

MAT 

UA 

GA 
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when multiple correlations were computed from a combination of 
NTE-C, GRE-T, and UA. Multiple R’s of .34 (N = 61), .45 (№ = 
66), and .32 (N — 127) were found for males, females, and the total 
group, respectively. The multiple R’s still did not reflect a significant 
increase over the UA-GA zero order correlation. 

In contrast to the Masters people, success in the Sixth Year Pro- 
gram (Table 2) can best be predicted from the GRE, with the 
Quantitative score being most effective for males, Verbal for fe- 
males, and the Total Score for the combined group. Although no sex 
differences were noted in predictability, it was found that males 
tended to have higher average scores on most variables. 

Success of the doctoral degree recipients (Table 3) was best pre- 
dieted by the NTE-C. As with the Sixth Year people, no sex differ- 
ences were noted, but perhaps a selection bias (self or institutional) 
was evidenced by higher average predictor scores for females, In 
some cases the sex differences were quite large. The large negative 
GRE-V correlation with GA was considered a result of sampling 
error, as might many of the correlations based on small samples. 

Failure of the MAT to meet the writers’ expectations might be 
somewhat a function of the lack of data. A sufficient number of 
cases to make a real test of predictive efficiency were not available. 

In general it took females longer to complete the program than 
males. This time variable, however, was not an effective predictor. 

It was almost impossible to determine а “most predictable” pro- 
gram. On either а comparative or absolute seale, differential pre- 
dictability was almost а “toss up.” When each program was asso- 
ciated with its best predictor, the results were about the same. A 
trend was noted such that UA, GRE, and NTE each in turn became 
best Predictors of success in Masters, Sixth Year, and Doctoral pro- 
grams. This might be interpreted as an increasing institutional and 
content emphasis on what constitutes success, 
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PREDICTION OF CHOICE OF AND SUCCESS 
IN AGRICULTURE AS A COLLEGE MAJOR! 


JAMES M. RICHARDS, JR. 
American Institutes for Research 


Iris a cliche that mankind is in a desperate race between growth 
in population and growth in food supply. Moreover, there is соп- 
siderable reason to believe that future increases in agricultural 
produetion, both in the United States and in other countries, will 
depend increasingly on the skill and level of training of agricul- 
tural workers rather than on purely technical improvements such as 
the introduction of hybrid seed varieties (Brown, 1967). For ex- 
ample, an important characteristic, compared to earlier seed vari- 
eties, of the seed varieties comprising the “Green Revolution” in 
underdeveloped countries is greater responsiveness to the applica- 
tion of fertilizer. А 

This implies that agricultural education at the college level will 
tontinue to be very important in the United States despite increas- 
ing urbanization and a decreasing proportion of the labor force 
engaged in farming, Welch’s recent study (1970) suggests that а 
substantial part of the contribution of education to the farm earn- 
ings of college graduates should be attributed to increased ability 
to keep pace with improvements in available inputs and in the allo- 
tation of these inputs among various uses. These are just the char- 
acteristics likely to be required for future gains in agricultural 
Production, and for maintaining а high level of productivity with- 
out destroying the ecosystem. : ; 

Boring the venu of this study was io investigate the 
E uut from the U.S. Office of Education under grant number 

EG-0-9-610065-1367. Opinions expressed аге the author's. 
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validity of Project TALENT (Flanagan, Dailey, Shaycoft, Gor- 
ham, Orr, and Goldberg, 1962; Flanagan, Davis, Dailey, Shayeoft, 
Orr, Goldberg, and Neyman, 1964; Flanagan, Cooley, Lohnes, 
Schoenfeldt, Holdeman, Combs, and Becker, 1966) tests for pre 
dieting choice of and success in agriculture as a college major? The 
overall goal of Project TALENT is to understand the nature and 
development of the talents of American young people. Because the 
questions involved are essentially developmental, the methodol- 
ogy has been, and continues to be, longitudinal. Specifically, in 
1960 a probability sample was drawn of approximately 5 percent 
of the high schools in the United States. The 400,000 students in 
grades 9 through 12 attending the sampled high schools were ad- 
ministered two days of educational and psychological inventories 
specially constructed for Project TALENT. The student inventories 
included measures of general ability, specialized aptitudes, interests, 
personality, student activities, home background, and plans for the 
future. The overall design of Project TALENT calls for follow-up 
Studies at intervals of one, five, ten, and twenty years after each 
class was graduated from high school. Thus Project TALENT pro- 
vides the first long-range longitudinal study of a representative 
sample of students assessed with a comprehensive set of psycho- 
logical, educational, and personal measures. 
_ The present paper will summarize some of the results of the orig- 
inal assessment and the five-year follow-up. Although data col- 
lection for all four five-year follow-ups has been completed, be- 
жи of the great mass of the Project TALENT data and the 
delays inherent in coding and keypunching responses to the follow- 
up questionnaires, not all of the merged files are complete at the 
Present time (fall of 1970). Therefore, this paper is restricted to 
c Who were enrolled in the eleventh and twelfth grade in 
. The study is further restricted to men, because of the very 
чир number of women who choose agriculture as а major. 
z criterion measures were used: (a) choice of agriculture 8$ 
intended major on the original assessment, (b) designation of agri- 


2 Because of the importance и А 
cul: a of (and the author's specific interest in) agri- 
жык м ап ias уу d attempts to pull together data about agricultu 
іа & йш form. Some of the same data were also іпсогрога 


(Richards, 1070) 04 O ів а more general treatment of college majore 


JAMES M. RICHARDS, JR. Ld 


alture as actual major on the five-year follow-up, and (e) Grade 
t Average in major field as self-reported on the five-year fol- 

low-up by those who designated agriculture as their actual major. 
The predictors consisted of 31 scores, including three а priori 
omposites of the TALENT ability tests (Verbal Composite, 


of socioeconomic status. These tests have been described in 
il in earlier Project TALENT reports (Flanagan et al, 1902, 


fith choice are point biserials comparing students choosing agri- 
lure with all other college students while correlations with GPA 


e, cross-validation data are not available, it should be men- 
ned that shrinkage in the multiple correlation values could be 


In interpreting these results, it must be remembered that there 
ceiling effects on the point biserials as а consequence of the 
proportion (2.5% to 2.996) of students choosing agriculture 


е as major—both on the original 
follow-up—is that they had 
terest in Farming scale. For both 
dents, the scale having the highest corre А 
Maturity from the TALENT personality test. It is somewhat sur- 
Prising that the aptitude composites were not more highly cor- 
i the predictors of choice and 


ss are not very similar. This trend complicates the task of a 
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counselor dealing with students considering agriculture as a college 
major or a career. It is possible, of course, that one or the other 
sets of predictors will be more similar to the predictors of success 
and satisfaction on the job in agriculture for these same students. 
The ten-year and twenty-year Project TALENT follow-ups should 
throw additional light on this question. 
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USE OF THE ROTC QUALIFYING EXAMINATION 
FOR SELECTION OF STUDENTS TO ENROLL IN 
ADVANCED COURSES IN ROTC AS JUNIORS 


THOMAS M. GOOLSBY, JR. лхо DONALD A. WILLIAMSON 
University of Georgia 


Tun Reserve Officer Training Corps Qualifying Examination 
(RQ) has been used since 1954 to screen college students into the 
advanced military course offerings beginning in the junior year. 
Antecedents to the 1954 examination date back to the years im- 
mediately following World War II. 

Equivalent forms, RQ-8 and RQ-9, were implemented in the 
spring of 1966. RQ contains measures of verbal and mathematical 
ability. 

The present study was designed to determine the extent to which 
RQ is useful for screening students into Advanced ROTC courses. 

Procedures. The RQ (Form 9) was administered to students ap- 
plying for Advanced ROTC in the fall of 1969 at a large south- 
eastern university when they neared the end of their sophomore 
year of college. Scholastic Aptitude Test (SAT) scores, Freshman 
Military Grade Averages (F-Mil), Freshman Grade Point Aver- 
ages (F-GPA), Sophomore Military Grade Averages (S-Mil), 
Sophomore Grade Point Averages (S-GPA), Junior Military Grade 
Averages (J-Mil), Junior Grade Point Averages (J-GPA), Cumu- 
lative Grade Point Averages (C-GPA), and certain subject area 
grade averages were obtained from student records. 

Intercorrelations among all measures were obtained. 

Results. The means and standard deviations for all variables 


are presented in Table 1. The estimated КЕ reliability for RQ-8 
mple (V = 300). 
Table 2 presents the intercorrelations among subtests scores and 
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TABLE 1 
Means and Standard Deviations for Certain Variables 
(N = 77) 
Variablo Mean — BD. 
ROTC Qualifying Examination-Verbal (RQ-V) 47.27 
ROTC quain Examination-Quantitative (RQ-Q) 3216 7 
ROTC Examination-Total Score (RQ-T) 79.19 1%@ 
Aptitude Test-Verbal (SAT-V) 509.87 168.22) 
Tests-Quantitative 
(SAT-Q) 51244 180.50 
Scholastic Testa-Total (SAT-T) 1054.49. 122.40 
Freshman Grade Average (F-Mil) 3.46 " 
Sophomore Grade Average (S-Mil) 3.00 S 
Junior Military Grade Average (J-Mil) 3.00 2 
Freshman Grade Point Average (F-GPA) 2.72 
Sophomore Grade Point Average (S-GPÀ 2.07 56 
Junior Grade Point Average (J-GPA) 2.61 E 
Cumulative Grade Point Average (C-GPA) 2.69 B 
English Grade Point Average PA 2.24 
Math 2.50 
2 
Point Average (SS-GPA) 2 


1 
It 
i 
i 


TABLE 2 
net ааны Total Tests Scores for RQ and SAT 
= 77) А 
RQ-Q SAT-V SAT-Q SAT-T 
a... sc cL MM S — 
К-У 15 69 38 61 7 


E 15 65 49 
58 64 71 


TABLE 3 
Correlations of Subiesis Scores and Tolal Тень Scores for RQ and SAT with Certain 
GPA's (N = 77 
ЕМИ “SMi CLME Rub. БОРА  IGPA COPA 
RQV 16 рр 13 яз 
17 12°46 
ROQ 10  30* 19 21 A us 22* 
ВОТ — 19 7 s» — i5 16 13 04 12 
лш о s P 
29 
ВАТТ 15 29¢ 19 En a ре Ms 


16 10 11 п 
7115 зебову о ere E N a ll 
3 ily different from zero at tho 05 level, 
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TABLE 4 
Predictions of J-Mil and J-GPA by Certain Freshman and Sophomore Grade 
Averages (№ = 77» 
e ت‎ EB MR QC QC (PP CM RR RR RR BRE 
J-Mil J-GPA 
F-Mil 50 % 
8-м 49 A 
F-GPA 20 45 
8-ОРА 50 65 
* Decimals omitted. 
TABLE 5 
Correlations of Subtests Scores and Total Tests Scores for RQ and SAT with Certain 
Subject Area GPA's 
(N = 77) 
پپپ پپپ‎ 
E-GPA M-GPA Be-GPA SS-GPA 
RQ-V 24* 08 05 23* 
19 28° 26° 16 
RQ-T 23* 08 17 20 
SAT-V 20 02 07 18 
8AT-Q 23° 23° 15 20 
SAT-T 24° 12 1 22* 
* Significantly different from zero at the .05 level. 
* Deci omitted. 


TABLE 6 
Certain Multiple Predictions of J-Mil and J-GPA 


(N = 77" 
< ü 

RQ-T & SGPA vs JMi = 50 
RQ-T & F-Mi vs J-Mi = 50 
SAT-T & F-Mi vs JMi = 51 
SAT-T & &GPA vs J-Mil = 51 
F-Mil & SMil vs J-Mi = 57 
КОТ & SAT-T vs Mi = 14 
RQ-T & SAT-T vs Ј-СРА = 12 

& SGPA vs J-GPA = 18 


total test scores for RQ and SAT. The correlation of RQ-V and 
RQ-Q (.15) is substantially different from the correlation between 
SAT-V and SAT-Q (.79). The correlation between RQ-T and 
SAT-T (71) is somewhat lower than had been obtained earlier 
(81). Even though the relationships of .71 to 81 are moderately 
igh, they are mostly irrelevant to and are no reasonable justi- 
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fiestion for the adoption and use of RQ for the purposes outlined 
by the users. 

The correlations of subtest scores and total tests scores for 
RQ and SAT with certain GPA's in Table 3 show very little and 
mostly no relationship at the .05 level of significance. The RQ 
shows no relationship to J-Mil or J-GPA at the .05 level. ' 

Table 4 shows F-Mil, S-Mil, and S-GPA to be the best single 
predictors of J-Mil. 

Again, the correlation of subtests scores and of total tests scores 
for RQ and SAT with certain Subject Area GPA's in Table 5 
show very little and almost no relationship at the .05 level. 

The multiple predictors of J-Mil and J-GPA in Table 6 are no 
materially different from the comparable single variable predictors 
presented earlier, 

The data presented in this paper suggest that the degree of the 
relationships between academic ability measures investigated and 
college performance (grade point average) is low and questionable 
in its practical significance. 

The data presented in this paper cast substantial doubt on the 
advisability of the use of RQ for the selection of even a reasonably 
large proportion of subjects from the population applying for ad- 
mission to Advanced ROTC. 


si an Росана Maarvesmann 
M, 317-314 


— A NOTE ON THE PREDICTIVE VALIDITY OF THE 
COOPERATIVE ALGEBRA II 


CLIFFORD B. TATHAM ax» ELAINE J. TATHAM 


Ottawa University 
Ottawa, Kanes 


Many colleges use a variety of instruments as placement ехал» 
Ínations. The object of these examinations is to aid in placing а 
fudent in the appropriate level of a particular course, Such place- 
nt testing is frequently done in the areas of English and Mathe- 
matics. While most of the instruments used are valid and reliable, 
ny advisors of college freshmen are frequently confronted with 
‘the problem of encouraging, or discouraging, а student to enroll in 
а particular course. Thus, the purpose of this paper was to deter- 
ine the predictive validity of the Cooperative Mathematics Test: 
Algebra III (Coop Algebra III) test at а small liberal arts college. 
terion was course grade. In addition, the reliability of the 
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for estimating reliability was used so that the ease of Saupe's 
method might be employed in future studies (Saupe, 1961). 

A discriminant analysis (Wert, Neidt, and Ahmann, 1954) was 
utilized to determine whether students who completed the М101 
course successfully (grade of C or better) could be differentiated 
from those students who completed the M101 course unsuccessfully 
(grade of D or F). 

Results. Using Hoyt's method of estimating reliability, a reli- 
ability coefficient r — .8695 and a standard error of measurement 
of 252 were obtained. These compare favorably with the published 
reliability of 84 and a standard error of measurement of 2.66 
(Educational Testing Service, 1964). 

Saupe's method of estimating reliability also yielded r = 8695. 

The discriminant analysis indicated there was a significant dif- 
ference between those students who successfully completed the 
M101 course and those students who failed to complete the M101 
course successfully (Е. 1: = 15.36; p < .01). 

The analysis resulted in the equation: v = .001239x. A critical 
value of v = .011979 was obtained; thus the critical Coop Algebra 
Taw score was approximately 9. 

Summary. The results of this study indicate that the Coop Algebra 
III is a reliable instrument when used as a placement test. In ad- 
dition, the test scores can be utilized by advisors to aid a student 
in deciding whether or not the student can successfully complete 8 
mathematics course the content of which consists of some algebra 
and an introduction to calculus and analytic geometry. 
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STATISTICAL ANALYSIS OF THREE CRITICAL 
THINKING TESTS 


JOHN FOLLMAN ax» WILLIAM MILLER 
University of South Florida 


ELDON BURG 
U. 8. Army 


“Introduction. The purpose of this study was to investigate the 
Psychometric characteristics of three critical thinking tests by de- 
termining their item difficulty and discrimination indices, reliability 
“Coefficients, item validities, and basic dimensions. A Test of Critical 
Thinking Form G (Form С) (American Council on Education, 
1951) ; the Cornell Critical Thinking Test Form Z (Form Z) (Ennis, 
961); and the Watson-Glaser Critical Thinking Appraisal Form 
M (Form ZM) (Watson and Glaser, 1964) were the tests used. 

| The tests were administered to students in a junior level educa- 
tional psychology course at Wisconsin State University, Oshkosh in 
May, 1967. The numbers of subjects for the different analyses 
"ranged from 190 to 227. Form G has 52 items, Form Z has 52 items, 
"and Form ZM has 100 items. 

— Results. Mean item discrimination indices were 34 for Form G, 
23 for Form 7, and .18 for Form ZM. FN, ; 
“Corrected split-half, and KR-20 total test reliability estimates 
Were, respectively, .792 and .819 a Form G, .548 and .632 for 
Form 7, and .655 and .667 for Form ZM. Ч 
Corrected split-half reliability estimates were moderately high 
T two and high for seven Form G subtests, moderately high for 
го and low for five form Z subtests, and moderately high for all 
Form ZM subtests. 

| Mean point bi-serial correlations 
Form Z, and .168 for Form ZM. 


were 297 for Form G, 200 for 
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Form G had a higher proportion of high, significant phi coef- 
cient inter-item correlations than either Form Z or Form ZM. 

Factor analysis produced 19 factors for Form G, 22 for Form Z, 
and 39 for Form ZM. With some exceptions, particularly Form G 
subtests, items did not load on factors consistent with the test mak- 
ers' a priori subtest groupings, for all three tests. 

Recognition-of-assumptions items loaded fairly strongly within 
unrotated factors for all three tests. 

Conclusions. It was concluded that Form G, Form 7, and Form 
ZM are reliable tests, that Form G is the most useful test, and that 
refinement should be undertaken for Form Z, and Form ZM. 
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CROSS-VALIDATION OF THE ORLEANS-HANNA 
ALGEBRA PROGNOSIS TEST AND THE ORLEANS- 
HANNA GEOMETRY PROGNOSIS TEST 


JOANNE M. LENKE лхо HAROLD Е. BLIGH 
Harcourt Brace Jovanovich, Inc. 


BERNARD Н. KANE* 


Pleasantville High School 
Pleasantville, New York 


Tun procedure of cross-validation is a necessary step in estab- 
lishing the predictive power of а prognostic instrument in which 
weights are applied to the separate parts making up the total score. 
The technique involves the determination of the “best” weights with 
an initial sample and the subsequent verification of these weights 
with a second sample from the same population. 

A total score on the Orleans-Hanna Algebra Prognosis Test and 
on the Orleans-Hanna Geometry Prognosis Test, is а composite of 
(a) four student-reported past course-grades, (b) student-predicted 
course-grade in algebra or geometry, and (c) number right of the 
Prognosis Test work-sample items. For both of the Orleans-Hanna 
tests, an a priori weight of two was assigned to each of the five 
course-grade variables and a weight of one was assigned to the test 
score representing the number of work-sample items answered cor- 
rectly. Validation studies (Orleans and Hanna, 1968 a, b) were 
undertaken during the norming of the tests to determine the multi- 
ple regression weights for predicting each of the four criteria of 
Success: mid-year and final course grades, and mid-year and end- 
of-year achievement test scores. The multiple correlations were then 
Compared to the zero-order correlations obtained with the a priori 
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weights. The results of these studies indicated that the weights, as 
initially assigned, could be applied in the final scoring process with- 
out significant loss in prediction, These weights were thus incor- 
porated into the standard scoring procedure and were used in com- 
puting the total score and related normative data reported in the 
test manuals. 

The purpose of the investigation reported here was to determine 
the appropriateness of the assigned weights in predicting similar 
criteria for new samples of algebra and geometry students. 

Method. In June, 1968, 335 eighth-grade students took the Or- 
leans-Hanna Algebra Prognosis Test and 331 ninth-grade algebra 
students took the Orleans-Hanna Geometry Prognosis Test. The 
school system cooperating in this study had not been included in 
the original validation sample. In January, 1969, the Mid-Year 
Algebra Test and the Mid-Year Geometry Test were given to those 
students who had enrolled in algebra or geometry. At the same time, 
algebra and geometry teachers, without access to test scores, re- 
ported mid-year grades for their students. In June, 1969, algebra 
students were given the Lankton First-Year Algebra Test and ge- 
ometry students were given the Howell Geometry Test. Final mathe- 
maties grades were collected independently of the achievement test 
Scores. Results of the prognosis tests were not released to school 
Personnel until all criterion data had been collected. 

Summary statistics, and zero-order and multiple correlation coef- 
ficients are presented in Tables 1 and 2 for the algebra and geometry 
cross-validation samples, respectively. 


TABLE 1 
Correlations of the Orleans-Hanna Algebra Prognosis Test with Four Criteria of 
Success 
ш ل‎ N 
Prognosis Test 
—————— C * " 
Sample SUN 
Total Taking Mid-Year Mid-Year Final Final 


Sample Algebra Grade* Test  Grade* Test 
M к A Sia Grade nn 


R 72 783 70 77 

3 .78 .73 78 
AE wa HE 193 190 hos 
SD 18.7 151 e 24.6 2.6 


С : 8.5 1.2 10.1 
'"A-&B-ARCIALDILENR wo. i 


Prognosis Test 


Sample 
Total Taking Mid-Year — Mid-Year 
Sample Geometry Grade* Tost 


TPA m4, B =3,C =2,D = 1, E/F = 0. 


Results and conclusions. Examination of Table 1 reveals that 
With the assigned a priori weights, each of the zero-order correla- 
(r) between the Algebra Prognosis Test and each of the four 


re optimally weighted, to make further refinement unnecessary 
ligh, Lenke, Hanna, 1969). Similarly, Table 2 indicates that 
ery little predictive efficiency is lost when the assigned weights 
&re applied in scoring the Geometry Prognosis Test. It appears, 
therefore, that the weights deemed appropriate for the original val- 
idation sample and recommended as part of the standard scoring 
ocedure are equally appropriate for the cross-validation sample. 
One can further conclude that these weights will be equally appro- 
priate for other samples of students within the same population. 
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А COMPARISON OF THE D-48 TEST AND THE OTIS 
QUICK SCORE FOR HIGH SCHOOL DROPOUTS 


BRAD 8. CHISSOM Axo RALPH LIGHTSEY 
Georgia Southern College 


Tur D-48 (Dominoes) Test has been used in its present form for 
а number of subject populations. The test, consisting of a series of 
nonverbal problems that use dominoes as the item format, is de- 
scribed as a nonverbal analogies test measuring general intelligence 
(Black, 1961). In a study of college students, Boyd and Ward 
(1967) obtained a correlation of .57 between the D-48 and the 
Otis Quick Score (Gamma Form FM). 

The internal consistency (KR) reliability for the D-48 test 
was reported in the same study as .85. Additional studies across age 
levels and cultures revealed the rank-order of the item difficulties 
to be similar. Rafi (1967), Gough and Domino (1963), and Welsh 
(1967) all cited similar rank orders of the item difficulties despite 
differences in the age and cultural background of the subjects. Us- 
ing a group of non-college French males, Pasquay and Doutrepont 
(1956) reported results on the D-48 with subjects similar to the 
ones tested in this study. : 

Purpose and method. This study was designed to examine two 
Measures of intelligence, the D-48 and the Otis Quick Score (Otis, 
1954), when used with a population of male high school dropouts. 
The subjects were full-time enlisted personnel enrolled in a General 
Educational Development course conducted by the United States 
Army, Sixty-one male subjects, whose permanent homes represented 
28 states, were included in the study. The two tests, the D-48 and 
the Otis Quick Score, were administered on two successive days 
during the regular classroom instructional period. The specific objec- 
tives of the study were: (a) to obtain a correlation between D-48 
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Scores and scores on the Otis as an estimate of concurrent validity, 
(b) to compare the rank-order of the item difficulties obtained with 
the rank-orders reported in other studies using different subject 
populations, and (c) to make a comparison of the mean score ob- 
tained for the D-48 with the mean score obtained from a compar- 
able group of subjects with a different cultural background. 

Results and discussion. Reliabilities for the D-48 and the Otis 
were calculated by the odd-even, split-half method increased by 
the Spearman-Brown Formula. The reliabilities were .92 for the 
D-48 Test, and .88 for the Otis. The КБ, reliability was .85 for 
the D-48. 

The correlation between the D-48 and the Otis was .27. Compar- 
ing this to the correlation of .57 reported by Boyd and Ward (1967) 
in their study with college students of the same age, the verbal abil- 
ity called for in the Otis probably accounts for the decrease. The 
decrease in the relationship between the tests would seem to support 
the idea that the D-48 assesses nonverbal intelligence. 

Comparisons of the rank-orders of the item difficulties were made 
between the rank-order obtained from the results of this study, 
the Welsh (1967) data for gifted high school students, and the 
Gough and Domino (1963) data for fifth and sixth grade pupils. 
The comparisons were made using the Spearman rank-difference 
correlation, and the data are shown in Table 1. The results agree 
with those reported by Rafi (1967), in which the correlation be- 
tween the Gough and Domino difficulty ranks and difficulty ranks 
obtained from a sample of Lebanese men was .95. This evidence of 
the magnitude of the relationship between the item difficulty ranks 
indicates the comparability of the D-48 across age levels and cul- 
tural backgrounds. 

Finally, the D-48 mean score of 19.03 for the 61 dropouts (see 
Table 2) was not significantly different from a mean of 19.78 for 
з sample of non-college French males, ages 20-25 (Pasquay and 


TABLE 1 
Tniercorrelations for Three Item Difficulty Rank-Orders 


Source of Item Ranks i 2 3 
1. High School Dropouts (Chi 1 
2 High School Gifted (Welsh) ^ RS а 


snd Sixth Graders (Gough-Domino) ees 


TABLE 2 
Me = ended Dela ПР Fonti, D48 Bon Hera Aad 
(N = 61) 
[e ———————————— DXiÓijasbn OIL!UL,AAAL س‎ 
Variable Mean sp 
Age 21.62 3.82 
Grade Completed 9.51 1.48 
D-48 Raw Score 19.03 6.33 
Otis Raw Score 29.43 8.75 


Doutrepont, 1956). This result is additional evidence of the cross- 
cultural nature of the D-48 Test. 

Summary. The D-48 and Otis Quick Score tests designed as mea- 
sures of intelligence were administered to a group of male high school 
dropouts. Results indicated that the relationship between the two 
tests is less for an individual representing on the average between 
а ninth- and tenth-grade level than for college students. Further 
results indicated that the relative levels of D-48 test item difficulties 
аге comparable across age levels, and for subjects with diverse 
cultural backgrounds. 
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VALIDITIES OF THE OTIS-LENNON MENTAL ABILITY 
TEST, THE LORGE-THORNDIKE INTELLIGENCE 
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Introduction. The Otis-Lennon Mental Ability Test (Otis and 
Lennon, 1967a, 1967b) could be a major breakthrough in the field 
of measuring academic potential. No doubt a great deal of ongoing 
and future educational and psychological research will make use of 
the improved Otis instrument. However, as Groteleuschen (1969) 
Pointed out, validity studies with the new test are scarce. The 


School District of Cheltenham Township, and a predoctoral USOE fellowship 
in educational research granted to J. R. McGowan at Lehigh University, 
thlehem, Pa. However, the opinions expressed herein do not гест 
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present study not only provides some initial information on the pre- 
dictive validity of the 1967 Otis-Lennon test relative to the Lorge- 
Thorndike instrument, but also yields basie data on the factor ana- 
lytic structure of the battery of tests. The relative predictive and 
construct validities of the Otis-Lennon, Lorge-Thorndike, and Met- 
ropolitan Readiness tests were investigated by several multivariate 
analyses: canonical correlation, factor analysis, and stepwise re- 
gression. 

Procedure. The pupils selected for this study comprised the total 
enrollments of the second and fourth grades for the academic year 
1968-1969 of a large suburban public school district in the Greater 
Philadelphia Area. The district is distinctly of upper socio-economic 
class. The majority of pupils are college-bound. Descriptive data 
(including the intercorrelations of all variables at second and 
fourth grades) for these pupils can be found in Table 1. 

The second grade pupils were administered the Lorge-Thorndike 
Intelligence Test (L-T IT) (Level A, Form 2) in October, 1968, and 
the Otis-Lennon Mental Ability Test (O-L MAT) (Elementary I 
Level Form J) in November, 1968. The Metropolitan Readiness 
Test (MRT) (Form A) had been given to 322 of the current group 
of 386 second grade pupils in April, 1967 (the end of their kinder- 
garten enrollment). 

The fourth grade pupils were administered the L-T IT (Level 
В, Form 1) in October, 1968, and the О.І, MAT (Elementary Ш 
Level, Form J) in November, 1968, MRT (Form A) had been given 
to 316 of the 469 present fourth grade pupils in April, 1965 (the 
end of their kindergarten enrollment). 

The main set of academic achievement criteria for the second 
grade pupils was the Stanford Achievement Test (SAT) (Primary 
II Battery, Form W), and for the fourth grade pupils was the SAT 
(Intermediate I Battery, Form X). Both tests were given in April, 
1969. Besides using standardized SAT criteria, it was considered 
feasible and interesting to have the teachers of both second and 
fourth graders rate their pupils on academic ability relative to the 
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school district as a whole (second or fourth) in specific areas. In 
each academic area selected, the rating was accomplished by a five- 
choice continuum. The code was: 5, outstanding; 4, good; 3, aver- 
age; 2, below average; and 1, very limited. For second grade, two 
teacher criteria employed were termed "Reading Comprehension," 
and “Arithmetic Computation.” For fourth grade, four teacher ori- 
teria used were termed “Reading Comprehension,” “Arithmetic 
Computation,” “Arithmetic Concepts,” and “Arithmetic Applica- 
tion.” The use of teacher rating (TR) in this study is similar to that 
of Kim, Anderson, and Bashaw (1968) in their canonical correla- 
tion study. 

Analyses. Several sets of analyses were undertaken: canonical 
correlation, principal-components factor analysis, and stepwise re- 
gression. The BMDO6M canonical correlation techinque (Dixon, 
1967, pp. 207-214) was selected to investigate the overall patterns 
of significant relationship and to suggest the factor analytic struc- 
ture of the total battery of predictors and criteria. However, the 
detailed factor analysis was left for the BMDO3M principal-com- 
ponents factor analysis (Dixon, 1967, pp. 169-184). Once the fac- 
torial composition of tests had been established, the final stage of 
the validity analyses was to assess the relative predictive powers of 
the three tests in question by the BMDO2R stepwise regression pro- 
gram (Dixon, 1967, pp. 233-257d.). 

The authors were particularly interested, from a methodological 
viewpoint, in how the canonical correlational analysis compares 
with principal-components factor analysis in the factorial-structure 
sense. 

s Many different philosophies underlie the interpretation of canon- 
ical correlation. Dunteman and Bailey (1967) discussed the differ- 
ences between factor analysis and canonical correlation, while Obn- 
macht and Olson (1968) emphasized the similarities. Maxwell 
(1961) interpreted canonical vector weights and compared them 
with factor analytic weights. By applying both principal-compo- 
nents factor analysis and canonical correlation to the same set of 
data, the authors hoped to assess the similarities and differences 
in the resulting factor-analytic structures, 
еа ve validity. Table 2 presents the canonical cor- 
т weights for second and fourth grades. Bartlett’s Test 
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TABLE 2 


" 2 Р ES -19 
O-L MAT Tot .57 1.18 ‚36 .39 -.49 
L-T IT Verb AT 0 
L-T IT Nonv 13 . 
L-T IT Tot 20 —.45 —1.21 
MRT 41 —.92 .70 А -1.06 
TR Rdg Comp 29 —.0%6 .21 Q8 —.00 72 -.90 
TR Arit Comp 08 —.40 04 о —.% B 
TR Arit Conc -1.10 -.02 51 —.% .40 
TR Arit App 19 Т жын! uv 
SA'T Word Mean —.0 -.M —.16 30 -.99 -—.9 -H 
SAT Para Mean .22 -.295  —.18 ‚15 .24 E ‚70 
SAT Sci SS 13 .00 —.65 
SAT Spel —.20 —.90 4 .0 —.62 06 КП 
SAT Word St S 18 16 42 ‚05 .50 a 8 
SAT Lang 17 21 .39 ‚26 .21 о = 
SAT Arit Comp -.п 62 -.20 =—.14 —.16 д -.0 
SAT Arit Cone 8  .72 ف‎ .M ‚в —.36 -1.0 
SAT Arit App .12 E 00 1.13 


Bartlett's Test of Wilks’ Lambda Criterion shows that for second grade, the first easonien! relationsisip 
at the 01 level, while for fourth grade, the first two canonical relationahips are significant at the 005 


of Wilks’ Lambda Criterion (Cooley and Lohnes, 1962, p. 37) was 
‘applied. 

Findings for fourth grade pupils. Out of the four canonical rela- 
tionships given for fourth grade, only the first two are significant 
at the 05 level. The first relationship (Re = .88), or factor, that 
underlies both predictor and criterion variables apparently empha- 
sizes the verbal composition of both the O-L MAT and L-T IT (ver- 
bal) tests in terms of dimensions common to the SAT: Word Mean- 
ing and to a lesser extent, SAT: Language. All these variables have 
positive weights. Thus, the first canonical variate is fairly easily 
"interpreted. The second variate (Re = .45) deals only with the ver- 
bal and nonverbal components of the L-T IT as predictors; in other 
words, the construct validity of the intelligence scores of the L-T IT 
may be somewhat clarified. One sees that the nonverbal component 
is most highly related to TR:Arithmetie Concepts, SAT: Word 
Study Skills, and SAT:Arithmetic Concepts. One should note, how- 
ever, that the SAT:Word Study Skills does appear to be out of 
Place with respect to the part of the dichotomy in this canonical 
variate that is presumably “nonverbal.” All the nonverbal variables 
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have positive weights. The verbal component of the second canon- 
ical variate consists of the negatively weighted variables of L-T 
IT Verbal, TR: Reading Comprehension, SAT: Word Meaning, and 
SAT: Spelling. 

Next, the canonical variate analysis involving the two significant 
dimensions for fourth grade pupils is compared with the correspond- 
ing prineipal-components analysis also yielding two factors. The 
latter results are given in Table 3. When only eigenvalues greater 
than one were used, two factors were extracted. The first factor shows 
that, in the total set of variables, the O-L MAT, L-T IT, and MRT 
are loaded heavily on the standardized SAT verbal criteria; this 
result is in accord with the first canonical variate above. However, 
the second factor points out one of the basic differences between 
canonical correlation and factor analysis, as discussed by Dunte- 
man and Bailey (1967). In Table 3, one sees that the second factor 
loads primarily on numerical criteria, and only modestly on the pre- 
dictor tests in question. Thus, while canonical correlation “forced” 
the predictor tests to have high weights on each variate, the prin- 
cipal-components factor analysis allowed the numerical variables 
to cluster together (“internal factor analysis”) without the inclu- 
sion of the predictor tests. 

Findings for second grade pupils. Turning to second grade, one 
sees the canonical variate weights in Table 2. The first variate (Re 
= .81) emphasizes the loading of the O-L MAT and MRT on 8 
dimension associated with the SAT: Arithmetic Concepts, SAT: 
Paragraph Meaning, and TR:Reading Comprehension criteria. If 
one is willing to believe that arithmetic concepts require verbal 
аз well as numerical abilities, then the first variate can be termed 
а verbal factor underlying both predictors and criteria. In the 
second canonical variate (R, = 27), the O-L MAT is weighted 
potively along with the SAT:Arithmetic Computation and SAT: 
Arithmetic Concepts, while the MRT and, to a lesser extent, the 
L-T IT (total scores) are weighted negatively in association with 
SAT:Spelling апі TR:Arithmetic Computation, Using the rela- 
tive weights on the criteria in the в 
T es closely related to the numerical criteria than to 
Mais پرا‎ while the opposite is true for both the MRT 

+ However, one must use caution in interpreting the 


econd variate, one sees that the 
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TABLE 3 
Principal-Components Factor Analyses* 
Бокой GAR? Fourth Grade 


Loading on Loading on Commu- Loading on Loading on Commu- 
Factor A Factor B nality — FaetorA FactorB вау 


79 83 32 50 
а зз T 
“ [ 
T Tot 22 76 63 
8 44 59 54 51 2 33 
Comp 73 46 75 64 58 74 
it Comp 70 45 69 23 90 7 
37 86 7 
Арр 37 86 8% 
d Mean 74 46 75 87 23 80 
Mean 71 52 78 84 31 50 
20 75 61 
87 13 77 70 30 5% 
78 30 70 71 35 63 
74 44 74 78 43 7» 
67 21 50 39 62 ы 
53 64 70 64 55 п 
61 52 “ 
that the 


‘01 level. 
In Table 3, one sees the principal-components analysis for the 


total battery of tests in second grade. The first factor is loaded 
mostly with verbal criteria, with slightly lower loading for the two 
"numerical criteria; the predictor tests have relatively low loadings. 
In the second factor, the O-L MAT and L-T IT load most heavily 
on SAT:Science and Social Studies. The factor-analytic picture is 
not so clearly defined for the second grade as for the fourth grade 
sample. 

Results—Predictive validity. Several separate stepwise regres- 
sion analyses were conducted in selected verbal and numerical 
areas. Table 4 presents the standardized regression equation re- 


Prediction of SAT achievement measures. In general, at the sec- 
ond-grade level, the O-L MAT appeared to be а better predictor of 
Standardized verbal and numerical achievement test preformance 
"than was the L-T IT. However, despite the large proportion of 

Yariance in any one particular criterion measure explained by the 
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0-1, MAT alone, the MRT usually accounted for а roughly similar 
nt. At the fourth-grade level, the regression analyses become 
somewhat more refined because of the distinction in the L-T IT of 
yerbal and nonverbal intelligence. For the SAT criteria, again the 
0-1, MAT appeared to be a slightly more valid predictor than the 
L-T IT, although the differences were not nearly so marked as in 
Second grade. 

Specifically, for the SAT Paragraph Meaning Test, the best 

single predictor was the O-L MAT in both second and fourth 
grades (Ra = .61 and R, = .75).* For the SAT Arithmetic Compu- 
tation Test, again the best single predictor was the O-L MAT in 
both grades (Ё = .44 and R, = 49). In fourth grade, the O-L 
MAT was the best single predictor of the SAT Arithmetie Con- 
cepts Test (R, = .68) and of the SAT Arithmetic Applications 
Test (R4 = .66). 
Prediction of TR achievement measures. Also of interest to the 
predictive validity aspect of the study was the relationship of the 
five-point teacher ratings in selected verbal and numerical areas 
to the three predictor tests. The best single predictor of TR:Reading 
Comprehension was the MRT in second grade (Ra = 59) and the 
O-L MAT in fourth grade (R4 = .71). For TR:Arithmetic Compu- 
tation, the best single predictor in second grade was the O-L Mat 
(ЕЁ, = .56) and in fourth grade was the L-T IT Nonverbal test 
(R, = .53). Finally, the L-T IT Verbal test was the best single 
predictor for fouth-grade TR: Arithmetic Concepts (Ry = .61) and 
for fourth-grade TR:Arithmetic Applications (Ry = .62). Clearly, 
in the case of the 5-point teacher ratings, the O-L MAT does not 
have a virtual monopoly on the highest predictive validities. Indeed, 
at the fourth-grade level, the L-T IT Verbal and Nonverbal tests 
` seem to be superior to the O-L MAT. 

Just how much confidence can be placed in the teacher ratings 
сап be partially answered by their relative canonical variate 
weights in Table 2. The high teacher rating canonical vector 
weights might be reflecting at the elementary school level the 
oft-repeated finding that teacher-given grades are more valid 
predictors of later achievement than are standardized tests. 


4 All predictive validities are given in terms of mu е 
coefficient at the first stage of the stepwise, model building; that is, the 
‘correlation involves only the best single predictor and the criterion. 
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Summary. Canonical correlation and  principal-components 
factor analysis were employed to study the factorial construct 
validity of the O-L MAT, L-T IT, and MRT. The results demon- 
strated some similarities and differences between the two analytical 
approaches to construct validation. 

Further, stepwise multiple regression was used to establish the 
relative predictive validities of the three tests in selected verbal 
and numerical areas. In brief, the O-L MAT appears to be at least 
as effective a predictor of verbal and numerical achievement as 
measured by the SAT and TR as is L-T IT and MRT. None- 
theless, the reader is cautioned to bear in mind the two major re- 
strictions on the generalizability of the study: (a) the pupils were 
definitely of above-average ability and very much verbally oriented; 
and (b) the O-L MAT and SAT are produced by the same pub- 
lisher.5 
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THE RELATIONSHIP OF AVERAGE SCORES ON 
INTELLIGENCE AND READING TESTS TO 
PERCENTAGES OF MINORITY GROUP STUDENTS IN 
ELEMENTARY SCHOOLS AND HIGH SCHOOLS 
IN A LARGE METROPOLITAN AREA 


WILLIAM B. MICHAEL, ROBERT A. SMITH, 4x» YOUNG B. LEE 
University of Southern California 


Ox September 30, 1969, the Los Angeles Times released for each of 
435 elementary schools and 47 high schools in the Los Angeles Unified 
School District both the mean IQ scores and the mean grade place- 
ment scores in reading from the state mandated tests that had been 
administered during November 1968 as well as the percentages of mi- 
nority students. Pupils in the elementary schools were given the 
Lorge-Thorndike Intelligence Tests, Form 1, Level D, Verbal Battery 
and the Stanford Achievement Test: Reading Form W, Intermediate 
II (level). Students in the tenth grade were administered the Lorge- 
Thorndike Intelligence Tests, Form 1, Level G, Verbal Battery and 
the Test of Academic Progress, Form 1, Reading Section. 

Purpose. It was the purpose of this investigation to report for the 
two populations of 435 elementary schools with pupils tested in the 
sixth grade and 47 secondary schools with students tested in the tenth 
grade the degree of correlation between (a) average IQ scores and 
percentages of minority students, (b) average grade placement scores 
in reading and percentages of minority students, and (c) average IQ 
Scores and average grade placement scores in reading. Such informa- 
tion might be expected to furnish a partial basis for generating a num- 
ber of testable hypotheses regarding relationships between level of 
affluence or economie opportunity in а school community on the one 
hand and level of measurable scholastic aptitude or scholastic attain- 
ment on the other. 
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Findings. For the two populations of elementary and secondary 
schools, respectively, the correlation coefficients between each of the 
three pairings of variables were as follows: (а) IQ scores and per- 
centages of minority enrollees, —.826 and —.985; (b) reading scores 
and percentages of minority members, —.824 and —.890; and (c) 
reading scores and IQ scores +.962 and +.985. Thus one could prediet, 
particularly at the high school level, with a relatively high degree of 
accuracy the average scholastic aptitude scores and average reading 
performance scores in a school from knowledge of the percentage of 
enrollees that belong to minority groups in a given school. It should be 
remembered, however, that it is not uncommon to find that correla- 
tions among means of several groups are considerably higher than 
those found among the individual measures within the groups. There- 
fore, any attempt to predict individual IQ or reading scores from per- 
centage of minority membership in a school would definitely not be 
warranted. The reader is left to draw his own conclusions and infer- 
ences from the data presented. 
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THE DEVELOPMENT OF A MEASURE OF 
VOCATIONAL MATURITY? 


BERT W. WESTBROOK, JOSEPH W. PARRY-HILL, JR, 
ax» ROGER W. WOODBURY 


Р North Carolina State University 


VocaTIoNAL maturity has come into fairly wide use as а variable 
mably important in the vocational adjustment of youth (Super, 
Crites, Hummel, Moser, Overstreet, and Warnath, 1957; Super and 
Overstreet, 1960; Gribbons and Lohnes, 1968; Crites, 1965). Research 
{о date testifies to the importance of the concept but its use is re- 
stricted by the lack of a practical, reliable, and valid instrument for 
easuring it. Both the Indices of Vocational Maturity (IVM) util- 
ized in the Career Pattern Study (Super, et al., 1957; Super and Over- 
street, 1960) and the Readiness for Vocational Planning (RVP) 
scales developed by Gribbons and Lohnes (1968) employed interview 
approaches which require the use of scoring manuals for assessing 
levels of vocational maturity. Collecting the data is time-consuming, 
and scoring requires a great deal of time from highly qualified person- 
‘nel. The purpose of this report is to describe the development of the 
Vocational Maturity Scale (VMS) (Westbrook, 1970), an instrument 
designed to provide an objective measure of an individual's general 
level of vocational maturity. 
_ А review of the literature dealing with vocational maturity sug- 
gested the following cognitive variables for which 200 multiple-choice 
were constructed: (1) Related Occupations, (2) Education Re- 
quired, (3) Duties, (4) Fields of Work, (5) Vocational Goal Selection, 


- iThis paper was supported by the research program of the Center for 
ceupational Education, located at North Carolina State University at Raleigh, 
North Carolina, in cooperation with the Division of Adult and Vocational 
arch, Bureau of Research, U. S. Office of Education. 
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(6) Vocational Problem-Solving, (7) Occupational Trends, (8) Apti- 
tudes Required, (9) Career Alternatives, (10) Course Selection, (11) 
Curriculum Selection, (12) Interests, (13) Job Success Factors, (14) 
Vocational Planning, (15) Abilities, and (16) Values. Form A of the 
VMS was comprised of 100 items, five in each of the 16 areas except 
Education Required and Duties which contained 15 items each. 
Form B contained an identical number of items in each area and was 
intended to be equivalent to Form A. Many of the items on Form А 
and Form B had been administered earlier to pupils in grades 6 (N — 
1019), 7 (N —2207), and 8 (N — 2044) for the purpose of obtaining 
item analysis data which were used as a basis for the revision of items. 

Form А and Form B were administered two weeks apart to a sam- 
ple of 307 ninth-grade pupils in one school. Form A had a mean of 
62.03, a standard deviation of 10.20, а KR-20 of .83, and a correlation 
of .60 with mental ability. Form B had a mean of 67.89, а standard 
deviation of 12.32, a KR-20 of .89, and a correlation with mental abil- 
ity of .59. The correlation between Form А and Form B was .74. 

To remove the unwanted factor of mental ability, each item on 
both forms was correlated with total scores and with scores on mental 
ability (Otis-Lennon). Then, the 50 items on each form having rela- 
tively high correlations with total scores and relatively low correla- 
tions with mental ability were identified. Each pupil’s answer sheet 
was rescored on the 50 selected items on Form A and the 50 selected 
items on Form B. 

Theshort version of Form A (50 items) had a mean of 36.02, a stan- 
dard deviation of 5.92, a KR-20 of 71, and a correlation of .54 with 
mental ability. The short version of Form В (50 items) had a mean of 
37.95, а standard deviation of 7.21, а KR-20 of .85, and a correlation 
of 49 with mental ability. The correlation between the short version 
of Form A and Form В (two-week interval) was found to be .65. 

То obtain data on the concurrent validity of the VMS, an indepen- 
dent sample of 28 pupils was administered both Form A of the VMS 
and Gribbons’ RVP, a vocational maturity instrument known to have 
Some predictive validity. The RVP scores (average of two indepen- 
dent Scorers) correlated .76 with scores on the VMS. The predictive 
validity of the VMS is currently being examined by determining 


whether high scorers make more appropriate vocational choices than 
low scorers, 


BERT W. WESTBROOK, ET AL. ыз 
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IN a previous article three of the writers (Zimmerman, Michael, 
and Michael, 1970) described the results of a factorial study underly- 
ing the development of an experimental instrument Study Attitudes 
and Methods Survey (SAMS) Test (Michael, Michael, and Zimmer- 
man, 1969). The purpose of this paper is to report on additional fac- 
tor-analytic studies that have been carried out to refine the dimen- 
sions of the instrument. 

Procedure. Both the experimental form of the SAMS consisting of 
167 previously analyzed items and a supplementary inventory con- 
taining 67 new items were administered to a sample of 168 students in 
introductory classes at Los Angeles City College. The supplementary 
test form was devised to furnish a means for adding items that would 
yield relatively more homogeneous and reliable scales of previously 
identified dimensions as well as a means for separating the persist- 
ence-conformity factor into two factors of academic drive and con- 
formity, Subsequent to a factor analysis of the intercorrelations of 
the 167 items and a separate factor analysis of the intercorrelations 
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of the 67 items in the supplementary instrument, items that exhibited 
low communalities and/or highly complex factorial structure were 
eliminated. One hundred and forty-four items of the original 167 and 
43 items of the 67 new ones were subjected to a varimax factor analy- 
sis (Kaiser, 1959). Two separate solutions were obtained involving 
the rotation of eight and of ten principal component factors. 

Results. In each of the two rotated factor analyses eight identified 
dimensions involving loadings above .35 on at least six item variables 
were described as follows: (a) academic drive, a form of achievement 
motivation involving to a large extent persevering behavior in rela- 
tion to earning high marks (extrinsic motivation); (b) conformity 
(realizing teachers’ expectations or meeting institutional require- 
ments in an exacting manner) ; (c) academic interest or learning af- 
fect-satisfaction (replication of а previously found factor described as 
love of learning for its own sake—intrinsic motivation) ; (d) anziety, 
or self-depreciation of one's ability to meet academic requirements in- 
cluding adequate performance on examinations; (e) alienation to- 
ward edueational institutions and toward teachers and administrators 
аз à group—a generally critical attitude toward how well the school 
and individuals in the power structure meet students’ perceived needs 
and expectations; (f) methodical and systematic approaches to 
study in contrast to disorganized and slipshod techniques and poor 
planning; (g) positive orientation toward teachers—a liking of and 
identification with the individual teacher and with selected academic 
characteristics of the school environment; and (h) manipulation in- 
volving political shrewdness or savoir faire on the part of students in 
exerting power over instructors to gain their own ends, as if at the ex- 
pense of the prestige of the teacher. 

Discussion. On the basis of the findings from the factor analyti¢ 
solutions, editorial revisions in certain items have been made, primar- 
ily to eliminate factorial complexity whenever possible. In particular, 
certain items Tepresenting а positive orientation toward the teacher 
were related factorially and logically to the academic interest factor 
and thus were merged with it. The new instrument has been refined 
to yield seven relatively independent dimensions consisting of 20 to 
25 items in each of the following factors: (a) academic drive, (b) 
conformity, (c) academic interest, (d) anxiety, (е) alienation, (f) 
emod and system, and (g) manipulation. New administrations of 
the instrument are being carried out with junior college and high 


of the scales relative to measures of academie performance. 
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A FACTOR ANALYSIS OF THE CPI AND EPI 


ROBERT D. ABBOTT 
California State College, Fullerton 


is scored in more than one scale. 
Both the CPI and the EPI were designed to deseribe personalities 


lated, and still others may be relatively independent of other traita. 
Regardless of the number of scales contained in an inventory, it is of 
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importance to know the number of independent personality dimen- 
sions measured by the scales. 

One method of determining the number of independent diminsions 
measured by a battery of tests or scales is to factor analyze the inter- 
correlations of the tests or scales. 

Method. Scores on the CPI and the EPI were available for 171 fe- 
male and 115 male students who participated in a test research pro- 
ject. Also available were scores on Edwards’ (1957) Social Desira- 
bility (SD) scale, Welsh's (1956) R scale, the Marlowe-Crowne 
(1960) scale, and Wiggins’ (1959) Sd scale, marker scales which 
have been found useful in identifying factors obtained in factor 
analyses of personality scales (Edwards and Walsh, 1964). 

Scores on the 75 scales were intercorrelated and factor analyzed by 
the method of principal components. Fourteen factors with eigenval- 
ues greater than 1 were extracted and rotated using Kaiser's Vari- 
max. The 14 factors accounted for 71 percent of the total variance. 

Results and discussion. Table 1 shows the EPI scales with the high- 
est, positive or negative, correlation with each of the 18 CPI scales? 
The table also gives the correlations of the CPI scales with the SD 
scale and the corresponding correlation of the EPI scale with the SD 
scale. 

Table 2 lists the EPI, CPI, and marker scales with absolute load- 
ings of .40 or greater on each of the 14 factors. The first two factors 
obtained are essentially the same as the first two factors found when 
the intercorrelations of the CPI scales alone are factor analyzed by 
the method of principal components with iterated communalities 
(Nichols and Schnell, 1963). Both the CPI and the EPI contain 
scales which measure Factor I. 

With respect to Factor II, on which the SD scale has a loading of 
га, 10 of the 18 CPI scales and 2 of the 53 EPI scales have absolute 
loadings of .40 or greater on this factor. Factor scores on Factor II 
would appear to be measuring what Edwards (1970) has called the 
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TABLE 1 
Highest Correlation between a CPI Scale and an ЕРІ Scale and Their Correlations 
with the SD Scale 

CPI Scale Te iuri EPI Seale n 

Dominance (Do) 36 76 Assumes Responsibility 
Capacity for status (Cs) 42 —46 Shy E 
Bocinbility (Sy) 46 —74 Shy =ål 
Social presence (Sp) 46 —59 Shy -4l 
Self-acceptance (Sa) 31 —65 Shy -4l 
Well-being (Wb) 76 —46 Feels Misunderstood -52 
Responsibility (Re) 30 —33 Self-centered -16 
Bocialization (So) 28 50 Conforms 06 

Self-Control (Se) 50 —50 Enjoys being center of 
attention -15 
Tolerance (To) 63 —55 Feels Misunderstood -52 
Good Impression (Gi) 55 59 Virtuous 25 
Communality (Cm) 14 о 
Ach. via Conformance (Ас) 58 53 Plans work efficiently м 
| Ach. via Independence (Ai) 37 —39 Anxious about performance —50 
Intellectual efficiency (Ie) 57 —44 Fools M. -5 
| Psychological Mindedness (Ру) 41 —41 Perfectionist -08 
Flexibility (Fx) 02 —58 Plans and Things 10 
—22 37 Makes friends easily 18 


| Femininity (Fe) 
l 


` tendency to give socially desirable responses in self-description and 
| what Block (1965) has called ego-resilience. In developing the EPI 
- scales a deliberate attempt was made to minimize this personality 
- dimension. 
` On Factor III, two CPI scales and eight EPI scales have absolute 
` loadings of .40 or greater. This factor is obviously better represented 
in the domain of the EPI scales than in the domain of the CPI scales. 
With respect to Factors VII, УШ, and X, factors on which both CPI 
` and EPI scales have relatively small loadings, there is little basis for 
| choice between the scales in representing the factors. 
` There are eight factors, Factors IV, V, VL IX, XI, XII, XIII, and 
- XIV, on which no CPI scale has an absolute loading of 40 or greater. 
"These dimensions of personality are, in other words, not represented 
| by the 18 scales in the CPI. 

The EPI contains 2.5 times the number of items in the CPI and re- 
quires approximately 2.5 times as long to administer. The additional 
testing time required with the EPI also results in approximately 2.5 
limes the number of dimensions of personality obtained with the CPI. 
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TABLE 2 
CPI and EPI Scales with Absolute Loadings Greater Than .40 on the Fourteen 
Rotated Factors 
[L——HSA!LAELLLL MJ —————————————— 
Factor I Factor II 
Self-acceptance 81 Tolerance -%“4 
Shy —79 Well-being -82 
Dominance 79 Intellectual efficiency 78 
Sociability 79 SD -7 
Assumes Responsibility 74 Self-control -71 
Isa 74 Ach. via Conformance —65 
Articulate 73 Ach. via Independence -62 
Self presence 67 Responsibility —56 
Self-confident 64 Good Impression -5 
Capacity for Status 49 Socialization —50 
Makes Friends Easily 48 Psychological Mindedness —50 
Self-critical —45 Feels Misunderstood 48 
Enjoys Being the Center of 45 Capacity for Status —46 
Attention Self-critical 42 
7 ООНА ios o o o o ш 
Faetor III Factor IV 
o. есере И ОТР 
Is a Hardworker 86 Avoids Arguments 78 
Is a Perfectionist 81 Independent in His Opinions = 
Persistent 77 Easily Influenced 70 
Plans Work Efficiently 77 Worries About Making a Good 58 
Impression on Others 
Plans and Organizes Things 72 Conforms 
Motivated to Succeed 72 Critical of Others =< 
Flexibility —54 Cooperative 43 
Avoids Facing Problems —52 Anxious About His Performance 42 
Absentminded —47 Sensitive to Criticism 4 
Ach. via Conformance 41 
STE sce O . 
Factor V Factor VI 
a wut E ESS 
Kind to Others —73 Dependent 82 
Helps Others — —69 Talks About Himself m 
a About His Possessions 59 Wants Sympathy й 
nsiderate —97 Co i i kx 
Feels Mi ya nceals His Feelings 
= тус N Tii A ono 
Factor VII Factor VIII 
com We" oc a 
Marlowe-Crowne 75 Flexibility 5 
irtuous 64 Psychological Mindedness 5 
qua Impression 61 Communality = 
eeu Sd 43 Cooperative —45 
mmunality —43 Wiggins’ Sd —44 
Conforms es 
cells i Mid ITUR SCR NN RR 
_ Factor IX Factor X 
Impressed by Status 77 Femininity 65 
Desires tion 76 Conceals His Feelings = 


ROBERT D. ABBOTT ыз 
TABLE 2—(Continued) 


Factor XIII Factor XIV 
E ted in the Behavior of ^ —75 Active в 
ers 
Understands Himself —66 Seeks Now Experiences ы 
—48 
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THE REDUCED SIZE ROD AND FRAME TEST 
AS А MEASURE OF 
PSYCHOLOGICAL DIFFERENTIATION 


TED NICKEL 
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Gover (1965) and Tyler (1965) both mentioned the need for a 
more portable, less expensive instrument with which to assess the in- 
dividual’s position with regard to being field dependent (FD) (Wit- 
kin, Dyk, Faterson, Goodenough, and Karp, 1962). A particularly 
clear description of the technical difficulties an experimenter must 
surmount was given by Vaught (1965) in which he described the need 
to blindfold subjects between trials of the RFT in order to prevent 
their establishing visual cues of uprightness while the experimenter 
was recording results of the trial. 

An apparatus is described in this study which is similar to the rod 
and frame test (RFT) in Witkin and Asch’s 1948 study. Nickel’s Por- 
table Rod and Frame Test (N-RFT), which is simple and inexpensive 
to construct, has criterion validity parallel to that of the more cum- 
bersome and expensive full sized RFT. 

Purpose. This experiment was conducted to determine whether the 
phenomena demonstrated in Witkin’s large RFT could be shown 
through using a reduced darkened chamber and reduced rod and 
frame. The criterion was the Embedded Figures Test (EFT). The 
N-RFT scores were correlated with the EFT to see whether this re- 
duced RFT was able to differentiate between FI and FD subjects. 

Reduced RFT construction, The main box of the N-RFT was made 
of 35" plywood with the back made of two pieces of 14" masonite. 
The rotating surface on which the luminescent frame was painted 
consisted of a 17” and an 8” circle cut into the masonite. These 
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circles were glued together and then bedd is position by woud 


For the оой, а black plastie material (.006 in. thieh), ually osad 
cover construction and agricultural goods was tailored to 64 the 
lox. The mask was from а set of US. Атау surplus goggle erd- 


the goggles. Electrician's plastic tape was used to jois the pame 
And to attach the plastic hood to the box as well as to attach the 
hood and goggles. 

Two 15 watt light bulbs were fixed in opposite corners to “charge” 
the luminescent surfaces between trials. 

The luminescent rod was 1” x 8”, the frame 12" square with the 
“шр 1” wide. The frame was painted directly onto the lange rotating 


Method. The N-RFT was administered following Wikia's (1949) 
instructions. However, only one set of eight trials, rather than the 
Original three sete of eight trials each, was given. According to 
Witken et al (1963), these extra two ects, whikh would hare i». 
volved tipping the subject 28° left and right, would tot mwaserably 
affect validity. The N-RFT deviation score for each 5 consisted 
of the sum (over eight trials) of the absolute deviations ia degrees 
arc of his placement of the rod from the true vertical. Following the 
N-RFT, subjects were given the EFT by wing Jacksons (1906) 
revision of the EFT procedure. Jackson's revisions have shortened 
the testing time by half and still have maintained more than 96 
reliability. 

The subjects ranged in age from 10 to 19 years. Of the 38 girls 8 

were in grade school, 10 in high school, and 20 were college freshmen. 
Of the 42 boys, eight were in grade school, seven in high school, and 
27 were college freshmen. The college subjects received course credit 
for their participation. 
Results. A test for reliability (internal consistency, odd-ewen, 
Brown corrected) yielded r = .92. Concurrent validity 
cients of the N-RFT with the EFT as criterion produced r = 
for all 80 Ss (r = .74 for the 38 girls, and г = .50 for the 42 
). All three coefficients were highly significant (p < .001). 
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A t-test of boys versus girls performance on the RFT resulted in 
t = 325 (df = 78, p < .01, two-tailed). The girls’ mean deviation 
score of rod placement from true vertical was 40.58 degrees arc, while 
the boys mean deviation score was 19.07. 

Discussion. The reliability (internal consistency) of the N-RFT 
compares favorably to that obtained with the full size version 
(Witkin, 1962). The validity of the N-RFT for measuring FI-FD 
was clearly established. While subjects’ mean deviation score on the 
N-RFT was about half that found by Witkin (1949), the N-RFT 
clearly maintains the ability to discriminate FI from FD subjects. 

Some of the subjects were markedly affected during the N-RFT 
session. Typical comments were: “My stomach began to feel upset,” 
and, “When the frame was shifted, it was like going over a hump in 
the road very fast in a car,” or “I had to close my eyes once in a while 
to keep from getting upset!” It was felt that the use of a flexible plas- 
tic hood prevented the subject from obtaining positional cues, as 
would be the case if the subject’s head was immobilized. The ab- 
sence of positional cues may have led to the marked effect of the 
N-RFT on the subjects as well as to the high agreement between the 
EFT and N-RFT scores. The tendency toward nausea under con- 
flieting conditions (visual versus gravitational) was reported by 
Witkin (1949). The significant difference between boys and girls in 
tod placement, with girls having a higher deviation score, also 
agrees with findings of Witkin et al. (1962). 

X Bercovici (1970) has used the N-RFT to assess 102 subjects’ rela- 
tive position in terms of psychological differentiation. The N-RFT 
was successful in defining two personality subgroups, FD and FI. The 
FI group was found to be significantly more aggressive. 

Tt is felt that the results of this validation study indicates that it is 
possible to use the rod and frame approach for measuring psy chologi- 
cal differentiation without the burdensome technical problems pre- 
sented by the full size instrument, The gain in convenience, it was 
shown, can be obtained with little or no loss of information through 
the use of Nickel’s Portable Rod and Frame Test. 

Summary. Witkin’s Embedded Figures Test (EFT) and Nickel's 
Portable Rod and Frame Test, (N-RFT), a version of Witkin’s Rod 
and Frame Test (RFT) were presented to 80 elementary through 
college age subjects. The N-RFT is completely portable, the size of & 
suitcase, and does not require a darkened room, Concurrent validity 
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the EFT as the criterion measure, yielded r — .70. A t-test indi- 
ted a significant sex difference beyond the .01 level as would be pre- 
dicted from Witkin's findings. It was felt that the N-RFT could be 
Used in addition to the EFT for research with psychological differenti- 
ation. 
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Gary M. Andrew and Ronald E. Moir. Information-Decision Sys- 
tems in Education. Itasca, Illinois: Р. E. Peacock, 1970. Рр. xii + 
177. $5.75 and $3.95 (paperback) 


The stated purpose of this book “, . , is to present the student and 
practitioner of education administration with an integration of (4) & 
formal structure for decision making and (b) —— 
designing an information system to support the decision making 
tion" (p. vii). While these objectives are seemingly foreign to educa- 
tional and psychological measurement, a close examination of the 
decision-making function indicates that а key component consists of 
a set of objectives which “, . . should be explicit enough to enable one 
to measure whether the objective is being realized" (р. 8). In this 
context the close relationship between systems analysis and summa- 
tive evaluation seems obvious. Both are concerned with specifying 
educational objectives, determining alternative methods of achieving 
the objectives, and evaluating which of the alternatives are most 
efficient in achieving the specified objectives. 

Since the book is intended to present an overview of a “. . . con- 
tinuum from information to decision that should be recognized when 
managing an educational system" (p. vii) it would seem reasonable 
to assume the book would develop such a continuum. Unfortunately 
this is not the case. The initial discussion consists of a brief expose of 
problem solving and decision making, One of the items in this discus- 
sion is а consideration of objectives which includes the following: 
“In education it may be argued that this is not true since everyone 
knows that the objective of education is 'to provide educated citi- 
zens' " (p. 8). This example is then used to illustrate the need for spe- 
cifying a set of criteria to evaluate objectives. It is suggested that the 
approach used by Bloom et al. in the Tazonomies would develop more 
precisely stated objectives than the procedure detailed in this treatise. 

The applicability of systems analysis to education is next devel- 
oped. The major rationale for this applicability is the process of the 
areospace industry in solving “. . . a great number and variety of un- 
precedented system design and development problems associated with 
national defense" (p.21). Quite apart from the undenied successes 
(and possibly some failures as witness the continuing eight year TFX- 
F111 airplane controversy) there is the basic assumption that national 
defense and education are sufficiently similar to be correctly subjected 
to the same sort of analysis. The authors argue that systems analy- 
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sis needs an objective which is something or things that can be 
as being accomplished by the system (p. 22), “what is really req 
is an analysis of the whole educational structure before the fact" 
25), and finally “the problem (of defining precise educational obje 
tives) involves differing value judgments among both educators aX 
the community on what the most important goals and purposes ol 
education аге...” (p. 29). In the context of a pluralistic society them 
three statements are, to this reviewer, seemingly incompatible. Appa 
ently the authors recognize this incompatibility when they 186 
“, .. we need to know the relationship between what is done in ве 
and what students learn in school. The systems approach to edueati 
helps to focus clearly on the unknown relationships and gives а eles 
indication of the directions educational research should take in thi 
future, and what answers we should be seeking” (p. 29). This abrup 
degeneration of systems analysis as the means of solving the prob 
of education to a procedure for developing guidelines for educations 
research is, to say the least, surprising (given the promise of funds 
undertake the research, the leaders in the field of education could um 
doubtedly provide the suggested directions for research and att 
same time shed а great deal of light on the authors’ “unknown rel 
tionship”). 1 
Model concepts are next introduced and a sample flow chart (p.89) 
is presented. It is doubtful that the subject of the flow chart would be 
of interest to the intended audience. Further, without some detailed 
explanation of flow charting techniques the advantages of the proce 
dure would not be readily apparent to the uninitiated. At the sa 
time this reviewer would question the necessity of including an ар 
pendix of flow chart symbols which would only be of use to individu; 
responsible for developing computer programs. ‘ 
Fortunately models are briefly considered; attention is next di 
ted to variables. “A variable in a system is a well-defined attrib 
which describes the value or condition of certain aspects of the syste 
It is important in systems analysis and model building that the 
nent variables in а system be described and understood. The de 
nation of what is pertinent is a highly subjective art form bei 
there are degrees of effect that various variables have in a system. 
Piano en: thing to remember about variables is that they mus 
her in such а manner that they are completely unambiguous 
teat ve the same meaning to all individuals concerned with the SYS 
(p. 35). One could only wish that the concept “variable” hat 
here to the last quoted requirement. т 
пе particularly difficult problem, educational objectives, is nex 
considered and disposed with “. . . we take the position that achit 
ment in basic subjects is the most widely accepted and the most 
кү dimension of educational output" (p.43) and “. . . sy 
pertormance сап be evaluated by the aggregation of the indi 
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ment of a comprehensive information collection and dispension sys- 
tem. One could only wish the consistency which is so admirably еті» 
dent in these sections concerned with the information systems bad 
been applied to the discussions of systems analysis and decision the 
ory. 


Новкит А. Surri 
University of. Southern. California 


Donald J. Brown. Appraisal Procedures in the Secondary Schools. 
New Jersey: Prentice Hall, Ine., 1970. Pp. iii + 182. $5.95, $295 
(paperback). 

"Appraisal Procedures in the Secondary Schools" is a title that 
gests either a survey of appraisal procedures presently used in 
schools or an instructional book on how to appraise high school 
dents. Apparently the latter is the principle intention since the а i 
in the preface, says, “The book is intended to help teachers, or 
prospective teachers, acquire and understand the principles and pro- 
cedures that will allow them to do а more effective job in 
student achievements.” However he also states that much of the 
content and structure of the book was provided by actual teacher 
and student comments from over 400 taped, structured inter- 
views. Since there is little indication of how many persons expressed 
the viewpoints given in the comment, except for occasional use of 
“typical,” “many,” or “one teacher,” it is really not а survey of 
current procedures. 

The use of the comments from the 400 taped interviews can be 
considered as illustrative, and perhaps as a readibility technique. The 
latter possibility is suggested by the putting of appraisal or measure- 
ment techniques in а form easy to swallow, with very little mathe- 
matical and what the author calls, “psychometric jargon." The small 
size of the book, some of the chapter headings, the style, and a good 
deal of the material suggest an attempt to sugar-coat what students 
presumably take to be the bitter pill of educational measurement. 


86 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Following an introductory chapter on the organization and the gen- 
eral methods to be used, the author proceeds with a chapter called, 
“The Better I Teach, the Better My Students Will Do on My Exami- 4 
nations." The author then gives some methods, not for better teaching, 
but for helping students do better on examinations. These methods 
include preparing students for examinations, setting up course ob- 
jectives that can be tested, giving suggestions to students that will 
ease emotional tension, testing students frequently, and carefully # 
going over examinations that have been taken. One should teach 
so that students do better on examinations, and students should be 
helped in every way possible to do better on them. Under such con- 
ditions teaching could become almost an examination coaching proc- 
ess, Better that examinations should be eliminated altogether than 
that this should come to pass. 

The best of what a student can get from a class probably cannot 
be measured by any test, less likely a teacher-made test. All we can _ 
hope for is that a test score is to some extent correlated with the | 
important factors that cannot be measured. These factors include 
such things as, interest of the student in the course and in what 
it means to his future; the fellowship of teachers and students working 
together; skills in locating and using resource persons and materials; 
and use of knowledge and skills from the course in conversation and 
discussion in and outside of the class. Such things could be deviated 
or eliminated through concentration of teaching on the improvement 
of scores on the very imperfect classroom measuring instruments 
that are likely to be used. 

Aa third chapter is called, “How do you Know Whether an 

‘amination is Good"? The question is not answered. For a test to 
be good, it is pointed out, it must be reliable and valid, and also а 
brief discussion of reliability and validity is undertaken on а very 
elementary and non-mathematical level, with some suggestions for 
improving reliability and validity. 

The chapter on “What Should Classroom Test Measure”? gives ® . 
fairly specific and helpful way of constructing a classroom test. 
Construction begins with working out the specific objectives of what 
is to be taught and evaluated. From these is made an outline of the | 
content of the teaching unit, with the amount of time to be spent in 
each area. The examination questions are then constructed in these 
areas, with the proportionate number of questions in each area ар- 
proximating the proportion of time spent in each area. So far so good; 
but in the detailed example given, the test content, as shown in the 

test plan, is taken directly from the unit outline. 
cc н ts mol, Чи Heu al ге presented iy 
EA 1s for a two weeks instructional period, at the end | 
of which the examination is to be given. In two weeks of instruction 
with fifty-one minutes per day surely more than two pages of ma- 


= 
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terial will be covered. A student having this outline, and all students 
should have it, could memorize everything to be on the test in fairly 
= aoe good cramming session should easily take care of that 


If an outline is to guide learning, then the learning the student 
has been guided into is what is of primary importance, not the guide 
itself. The general facts and statements of the outline are a guide to 
the more specific statements, demonstrations, and explanations of 


_ basic course content. Deduction as well as induction is an important 


part of learning, and the student should be able both to develop a 
generality and to support a generality with specific factors and de- 
tails. Consequently, in testing it is important to discover whether or 
not a student has these specifics with which he can demonstrate and 
support the general statement. It is for this reason that all material 
in a course is important, if it belongs in the course at all, and the out- 
line is only a very small bit of the course content. 

The two following chapters are concerned with the construction 
of essay and objective type tests. Some good points are given, though 
not everyone would agree that the essay examination is a more valid 
measure of achievement than the objective type, or that one should be 
sure that one is measuring things other than knowledge or specific 
facts with an essay test, since these are best measured with objective 
tests. The importance of proof-reading essay question answers is 
affirmed, but not for objective examinations, for which it is at least 
as important. It often happens on a true-false or multiple-choice item 
that the student inadvertently marks a choice he had not intended or 
misreads a question. As a matter of fact the author himself makes 
such an error in one of his examples, on item construction: 


(T) F 9. Benjamin Franklin lived a long time ago. 
Better: 

(T) F 9. Benjamin Franklin lived in the sixteenth century. 

Even so, there is much that should be helpful to students in these 
chapters, as well as in the following chapter on constructing and 
analyzing an objective test. In the discussion on item analysis, how- 
ever, one might question the use of the term “validity index” for the 
difference in proportion of right answers between the upper and 
lower groups, which ordinarily is referred to as a kind of item dis- 
crimination index. In this chapter he states again, though somewhat 
differently the purpose of testing: 

Since the major purpose of a classroom exam is to discriminate 

among how much each student learned in a unit of instruction, it is 

desirable to have a test that yields a wide range of scores. 


Translated, this statement recognizes that a classroom test seeks to 
differentiate students as to their relative amounts of learning, but 
does not measure the actual quantity of learning achieved by each 
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student. His statement, therefore, is somewhat at odds with the earlier 
one that the purpose of the classroom examination is to determine 
how much the students have learned, which implies a scale of a higher 
order than those used in the classroom. 

The last quoted statement also does not agree with his earlier as- 
sertion that how well a student does on a test is an indication of how 
well the teacher has taught. If there is a wide dispersion of scores, 
then, if the latter is true, the teacher has taught some students well 
and some poorly. In other words the teacher who has constructed 
a good test is revealed as a good teacher for some students and a poor 
teacher for others. If a teacher really wants to know how good or 
bad he is, he better let someone else do the evaluating. 

Then follows a little chapter on statistics, which is entitled: “The 
Mystery Hour: A Few Statistics.” It is inexcusable to present sta- 
tisties to students in this way. Too often statistics seem to them an 
abracadabra pronounced over numbers to make them good or bad. 
To foster such a superstition through the use of this chapter title, in 
place of attempting to bring about an acceptance and understanding 
of the use and value of statistics is to defeat the cause of improved 
measurement. The few statistics that are given are presented so in- 
adequately that an hour spent on this chapter would truly be а 
mystery hour. 

In the chapter on assigning grades there is some good discussion, 
but the measurement student who turns to it in the hope of receiving 
much help in assigning grades is likely to be disappointed. Perhaps 
he need not be too much concerned, however, for the author states: 


Although there is some evidence to support it, this chapter's intent 
is not to condemn all grading practices and demand their immedi- 
ate revision, but to indicate that marking procedures are necessarily 
only as good or as bad as the teacher who is trying to apply them. 


"This statement implies that there are no good or bad ways of grading, 

there are only good or bad teachers. If you are а good teacher, use 

any method of grading you want, and if you are a bad teacher you 

Hie as well use any method too, since then in any case it will be 
ad. 

This chapter discusses grading by inspection, using gaps to indi- 
cate grade separation, and grading on a curve. The weaknesses of 
these methods are pointed out, and the author seems to prefer the 
“modified curve,” in which other factors than the test score is taken 
into account, such as homework and class participation, though he 
fails to indicate how each of these is to be evaluated separately. The 
procedure he suggests for computing a final grade consists of adding 
the weighted grades of examinations, papers, reports, and class reci- 
tations: but again, no analysis is made of qus bach. separate grade 
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might be assigned, except for the methods of test, grading of w 
he is highly critical. 

In the tenth chapter, “Published Tests: Ап Evil or a Blessir 
there is a good discussion of intelligence testing оп a very gen 
and elementary level, giving some important precautions for in 
pretation of IQ's. There is no specific help, however, in interpre 
obtained IQ's. No mention is made of the terms "gifted," and “п 
tally retarded." It is emphasized that the interpretation of sc 
should always take into account measurement error, but there i 
mention of how it should be taken into account. There is also a ! 
discussion of standardized achievement tests, of student motivatic 
ranking such tests, and of telling parents and students the result 

In the final chapter, “А Look into the Future,” instructional { 
vision, programmed instruction, computer assisted instruction, { 
instruction, and tele-lecture are discussed very briefly. The 
concludes with what is apparently the theme of the book, "The 
to better evaluation of student achievement, now as in the futur 
the teacher." One might add to this that one key to better teac 
is better evaluation of student achievement. 


CARL А. CLARK 
Chicago State Co 


Frederick G. Brown. Measurement and Evaluation. Itasca, 
F. E. Peacock, 1971. Pp. xiv + 198. $3.95 (paperback). 


According to its author, this book *was written primarily fo 
introductory course in educational psychology" and to emph: 
"aspects of measurement and evaluation most pertinent to the c| 
room teacher." It should be evaluated with these purposes inn 

The introductory chapters are concerned with the functions of 
and the basic qualities of measuring instruments—standardiza 
objectives, content sampled, directions to insure uniform testing 
ditions, scoring, consistency or reliability, and validity. Both reli 
ity and validity concepts are explained and illustrated quite 
equately considering the level of its intended student audience. | 
paradigms contrast graphically test-retest reliability, or the с 
cient of stability, equivalent forms reliability, the coefficient of ec 
alence, and the coefficient of stability and equivalence. Simil 
criterion-related validity is explained and illustrated graphic 
These paradigms would seem excellent teaching devices. Int 
consistency and split-half coefficients, construct validity, and 
tent validity are briefly and reasonably adequately explained. 
reviewer does not agree that construct validity should be of “lin 
concern to most classroom teachers” although the author is prok 
justified in saying that it is of limited concern. 

The discussion of the interpretations of test scores in tern 
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norm data and types of derived scores is well explained and illus- 
trated. Commendably critical mention is made of criterion-refer- 
enced measurement. (This reviewer deplores the term criterion- 
referenced tests because of probable confusion with criterion-related 
validity. The idea of such interpretation of test data is almost as old 
as the measurement movement although tests have too seldom been 
designed to promote its accomplishment. ) 

The chapter on classroom tests contains excellent “GUIDELINES 
FOR CONSTRUCTING MULTIPLE-CHOICE ITEMS” and 
“GUIDELINES FOR CONSTRUCTING TRUE-FALSE ITEMS” 
though this reviewer does not believe that teachers should be en- 
couraged to construct the latter. The discussion of matching items is 
most inadequate. There could be much more explanation and illus- 
tration of matching exercises suitable for use with hand-scored, or 
machine scored, answer sheets. There should be some discussion of 
keylist exercises. These are especially useful with reference to quoted 
material in measurement of intellectual skills. 

The “GUIDELINES FOR CONSTRUCTING ESSAY QUES- 
TIONS” and the “COMPARISON OF ESSAY AND OBJECTIVE 
TESTS” are excellent “EXHIBITS.” They are accompanied by 
satisfactory discussion of advantages and limitations. 

y The chapter on analyzing test scores and test items, the explana- 
tions of the computation of percentile ranks and standard scores, and 
introduction to elementary descriptive statistics in an appendix are 
unusually clear and complete for so small a book. 

The later chapters of the text deal successively with examples 
of well-known standardized achievement, general scholastie aptitude, 
and vocational aptitude batteries. The kinds of exercises in each of 
the subtests of each of the batteries are illustrated by ample quota- 
tions of their practice exercises. 

Fac се е» of the book contain discussions of grading, the 

S of instruction, and include а summary or “recapitulation.” 

fn book has some minor limitations, it should serve the 
чое ee it pos written. It is the opinion of this reviewer that 
erve, when accompanied by one of the recent books of 


readings in the field, a good choice for an i 
educational measurement, EA 


Max D. ENGELHART 
Duke University 


ath. Basic Statistical Methods. (8rd 
er and Row, 1970. Pp. xi + 356. $9.95. 
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New York: Harper and Row, 1970. Pp. 


N. M. Downie and R. W. He 
Edition). New York: Harp 
N. M. Downie. Study Guide 
Methods. (3rd Edition). 
125. $2.95 (paperback). 


Several years ago when this reviewer began teaching introductory 
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statistics, a book often recommended was Basic Statistical Methods 
by Downie and Heath. Obviously, any text which has now gone 
through three editions must be satisfying the needs of many in- 
structors and students. Basic Statistical Methods (3rd ed.) does have 
its several virtues. The book is well written, easy to read, and makes 
almost no demands upon the mathematical abilities of the student. 
One might also classify this text as a “cookbook,” a rubric not inten- 
tionally opprobrious but one which does indicate that many topics 
are not explored in depth. Problems (and their answers) are pro- 
vided for each chapter. 

Basic Statistical Methods is composed of 18 chapters which are in 
turn divided into three parts. “The first nine chapters present descrip- 
tive statistics, and the next seven consist of an introduction to 
statistical inference, The third part consists of two unrelated chapters, 
first an introduction to test theory and construction, and second, а 
look at the more frequently used distribution-free statistical tests.” 
(xi) 

Technically, there is little to quibble with in the first part. Some 
topics are discussed which, to this reviewer’s mind, might well have 
been omitted. For example, the computation of square roots and of 
grouped data techniques (even to the extent of giving Charlier's 
check and Sheppard’s correction) seem out of place in a modern 
statistics book. And purists may be bothered by the use of N rather 
than N — 1 in the product-moment correlation formula. Nevertheless, 
this section is readable and competent. Indeed, the first nine chapters 
(encompassing an introduction, a mathematical refresher, frequency 
distributions, averages, variability, standard scores, product-moment 
correlation, other correlational techniques, and linear regression) are 
well suited to accompany an introductory course in measurement. 

The second part is the weakest section of the text. Weaknesses in- 
clude anachronistic advice, inappropriate examples, and errors of in- 
terpretation. The probability chapter is traditional, i.e., similar to 
high school texts circa 1950. A welcome inclusion is the chapter on 
sampling. Unfortunately, this otherwise good chapter is marred by 
its treatment of confidence intervals. With a set of sample data, con- 
fidence is given a direct probabilistic interpretation (p. 164). The 
chapter on hypothesis testing offers a non-standard convention rela- 
tive to significance: when the authors discuss two tailed tests, they 
speak of level of significance; when they present one tailed tests, they 
refer to point of significance. In choosing between t and Z test sta- 
tistics in an independent groups situation, the authors advocate using 
the criteria of sample size and homogeneity of variance. No explicit 
consideration is given to whether the population standard deviation 
is known or estimated, although the formulas presented (£ and Z) in 
effect, contain variance estimators. There are no discernable nota- 
tional distinetions between sample variances and variance estima- 
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which Kaiser (1963) called the most psychometrically defensible 
approach extant, is an egregious omission. 

The chapter devoted to the number of factors problem also 
lacks а modern orientation. The reader is not informed of con- 
temporary methods of determining the number of factors. The 
statistical tests applicable in maximum likelihood solutions are 
not mentioned. This perhaps may be accounted for by the general 
absence of any discussion of maximum likelihood methods. However, 
the omission of the psychometric bounds recommended by Guttman 
(1954) and Kaiser (1960, 1963) is inexcusable. The authors’ rec- 
ommendation not to rotate principal components is intolerable in 
light of the successful application of what has become known as 
the “Little Jiffy” method of factor analysis. 

The Harris-Kaiser (1964) method of oblique transformation is 
not mentioned. The treatment of factor scores is incomplete and 
outdated, with more recent writings and recommendations completely 
ignored (e.g., Harris, 1967). 

Granted, then, that the text is not an introduction to modern 
factor analysis, might it assist the reader in understanding the 
basie process of factor analysis? The answer is a qualified, No. The 
authors’ emphasis on geometrical Tepresentations is helpful in 
understanding the concepts of centroid, factor loading, and orthog- 
onal and oblique (reference and primary) factors. Thurstone’s 
cylinder and box problems are handled nicely and used in one 
enlightening chapter dealing with the meaning of factors (Chapter 
5). In the preface, the authors wrote: “We have written this book 
for the mathematically unsophisticated reader because he has the 
most difficulty in finding out about factor analysis... . We have 
tried to restrict the topics and depth 


knowledge of factor analysis 
the text is intended to be an 


(p. 15), factor matrix (р. 15), referenc f 1 
data (р. 43), gross rank (p. 26), d inan, (Р. 42), fallible 
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ambiguity, tedium and resultant frustration will enable the reader 
to continue with his efort. The frustrated reader can find solace 
with the typist of the text who at the end of Chapter 2 “found the 
next chapter particularly tedious but was delighted with the mean- 
ingfulness of Chapter 5" (р. 35). One wonders what the typist’s 
opinion was of Chapter 4. Guertin and Bailey, in referring to 
strueture and pattern values in connection with oblique solutions, 
&dd to the confusion already present by adopting their own set of 
labels (p. 106), and using such terms as factor loading, factor ma- 
trix, and factor in the broadest sense possible (p. 104). On another 
occasion, the authors incorrectly refer to the correlation between a 
continuous dependent variable and а dichotomous independent 
variable as a biserial т rather than as the correct point-biserial r 
(pp. 199-200). 

Guertin and Bailey’s informal, casual writing style (“You say, 
‘Hooray! I’m sure glad it didn’t come out to .90 because it should be 
measuring something different” p. 225) may appeal to some readers, 
but insult others. Some direct quotes are made without adequate 
references to the author or page number (e.g., p. 222, p. 202). 
Several comparisons are made between various derived solutions 
using a “factor loading code.” The code is incorrectly explained 
on p. 131; in several tables coded information is not clearly repre- 
sented (e.g., p. 150). 

By their own admission, Guertin and Bailey have not provided a 
scholarly reference on factor analysis. One reads on the dust jacket 
that “the book was developed as a text for the senior author's 
classes but it parallels much of what has been explained to col- 
leagues in consultations about how to analyze their data.” Per- 
haps this book is appropriate for the aforementioned colleagues. 
However, novice mathematical thinkers could do no better than 
heed the suggestion of the authors to begin their study of factor 
analysis from a more mathematical text such as Harman (1967). 
The nonmathematical beginner might more profitably begin with 
Thurstone (1947), or Cattell (1952, p. 1-108), then turn to Har- 
man (1967). To understand modern methods of factor analysis, 
the reader must turn to references such as those listed below. A 
good, introduction to modern factor analysis has yet to appear. 
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statistical model which will reflect the research question originally 
asked. The text begins with the presentation of a structural model 
to account for human behavior according to а three level classifica- 
tion scheme of independent or predictor variables: (1) within per- 
son; (2) focal stimuli characteristics, and (3) context character- 
istics (i.e., environment). It is then suggested that these variables 
can be arranged in a multiple regression equation to represent 
the hypothesized functional relationship in much the same manner 
economists utilize least squares procedures to specify theories. 
Chapter II provides a review of inferential statistics with а focus 
on the concepts of variance and error sums of squares, Chapter ШІ 
presents a brief discussion of the representation of variables within 
а vector framework. It is of interest to note that sets of vectors 
are utilized to describe the design of interest while most treat- 
ments of least squares analysis use a matrix to specify the experi- 
mental design of interest. This chapter also presents, in a non- 
mathematical fashion, some basic concepts of vector algebra. The 
fourth chapter compares the regression analysis and the analysis of 
variance algorithms for testing stated research hypotheses, 

In the fifth chapter, the reader is introduced to use of the com- 
puter program "LINEAR" for obtaining answers to the questions 
posed by the researcher. This program is derived from the PERSUB 
computer subroutine system developed by Joe Ward, Jr. at the 
Personnel Research Laboratory, Lackland Air Force Base, Texas. 
Since Joe Ward and Bob Bottenberg were responsible for the de- 
velopment and dissemination of this particular approach to data 
analysis as well as the program reproduced in the text, it would 
have been appropriate to at least acknowledge the lineage of the 
program. It should also be noted that the same approach to data 
analysis can be followed with any multiple regression computer 
program. The major attraction of LINEAR are the provisions for 
data manipulation and for the tests of hypotheses via comparison 
of full and reduced models. : 

Roughly one third of the text is contained in Chapter VI where 
the reader is quickly introduced to the use of regression analysis to 
solve research questions of a fairly complex nature. Methods are 
presented to test effects of: (1) independent variables ina cate- 
gorical and/or continuous forms; (2) interaction and non-linear 
forms of curvilinearity; and (3) higher degree polynominals. 
data topics in Chapter VI are treated in both intuitive and formal 
presentations. The intuitive approach indicates what is the сш 
of arbitrarily selecting partial regression weights while the fo 
presentation gives the mathematical basis for solving the regres- 
sion equation developed to answer the теда Багар 

Chapter VII presents an extension of the regress 
approach to problems where statistical adjustment through the use 
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of co-variables is required. Also included is a section on the rela- 
tionship of multiple regression analysis to other multivariate pro- 
cedures such as discriminate analysis, factor analysis and canon- 
ical correlational analysis. Somewhat surprising is the lack of 
discussion concerning the relationship between multiple regression 
procedures and the more general multivariate analysis of variance 
approach to data analysis, 

The last chapter (VIII) entitled “Special Considerations Regard- 
ing Multiple Linear Regression Analysis” presents a potpurri of 
recommendations and comments concerning the use of multiple 
regression programs such as LINEAR. Included are topics related 
to handling missing data, accuracy of iterative solutions, simulat- 
ing an analysis of variance via regression analysis, repeated mea- 
sures, and data transformation. Also included, as an afterthought 
it appears, are algebraic proofs relating equivalence of different tests 
of significance and R? and variance of predicted scores. 

According to the authors, the text was written for prospective 
and practicing researchers in the behavioral sciences, Since it is 
likely that prospective readers may have differing levels of skills 
in the areas covered in the text, the extensive use of descriptive 
chapter subheadings would provide a person with an opportunity 
to pick and choose among topics to be studied. The availability of 
problem sets with answers would tend to suggest this book could be 
used in a self study format. Somewhat surprising is the lack of 
statistical tables. While LINEAR does provide a calculation of the 
approximate probability for the obtained F statistic, it is likely 
that some prospective users of this text may choose not to imple- 
ment LINEAR, particularly in view of the accuracy level provided 
by the iterative process of the program (see pp. 25-254). By using 
а local regression analysis computer program, a researcher can 
easily calculate an F ratio by hand and refer to an F distribution 
table to determine the significance of the obtained F value. 

This reviewer has used the approach of Kelly et al., for intro- 
ducing students in a first year experimental design course to the 
concept of general linear models. Since most of these students have 
been exposed to classical research design procedures prior to this 
course, they have no trouble seeing the isomorphism between the 
analysis of variance and regression approaches for simple one way 
and factorical designs with two levels per factor. However, the stu- 
dents appear to have a great deal of trouble in seeing the equiva- 
lence of the two procedures when the attempt is made to extend 
these ideas to more complex situations. There is, of course, the 
hope that professionals trained in classical methods of data analy- 
sis will attempt to acquaint themselves with the general linear 
hypothesis approach to data analysis. 

Of greater concern to this reviewer is the apparent mistaken 


m 
ii 
jj 
li 
itl, 


: 
i 
ғ 
D 
M 
Р 
i 
i 
ї 
ا‎ 


i 
Fz 
| 
8 
i 
| 
se 
| 


Sg 
i 
| 
is 
H 
i 
H 
si 


qualified statements can lead the relatively 
setatis Ый — NN 


The use of regression analysis with higher factorial 
cal designs with unequal sample sizes also requires 
When treatment group sizes are unequal 
model effects into the regression equation 
sums of squares, Also, the algorithm utilized 
may result in different conclusions. That is, some 
will report only the SS remaining after a 
accounted for and some programs will provide sums 
corrected for the unbalance in the design. 

In summary, this author feels that а source describing 
linear hypothesis approach to data analysis be available 
to researchers with a behavi or social 
However, the enthusiastic approach provided by Kelly and bis 
colleagues appears to only point out the good aspects 
of the problems associated with regression analysis. Joe 
Earl Jennings are presently at work on а 
with the same methods in data analysis; it would seem that a person 
who wants the last word on this subject should await this upcoming 
text before making a choice as to which single text on regression 
analysis he should have in his library. 
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Howard B. Lyman. Test Scores and What They Mean. (2nd ей) 
Englewood Cliffs, N. J.: Prentice-Hall, 1971. Pp. viii + 200. 
$6.95 and $4.95 (paperback). 


TThe second edition of Test Scores and What They Mean is 
unique in many ways. Like the first edition, it differs from the 
cal book devoted to educational and psychological testing 
number of important respects. Primary among 2 
it concentrates heavily on test scores and their meaning, including 


50 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


only minor amounts of information about such topics as test con- 
struction, scoring, and administration. Although a variety of stan- 
dardized tests—particularly those in the area of achievement and 
aptitude—are mentioned for illustrative purposes, the author avoids 
detailed descriptions of such instruments. In short, the title is 
reasonably descriptive. 

Another noteworthy aspect of the book is that it is designed to 
assist practicing professionals who are not knowledgeable about 
testing, for example, school teachers, social workers, admissions 
counselors, and even psychiatrists and pediatricians. In addition, 
the author believes that the book is suitable for use in connection 
with college level courses in educational psychology, educational 
and psychological testing, and guidance. 

The first two of its 12 chapters introduce the reader to modern 
testing practices and associated vocabulary. Following these із а 
highly abbreviated chapter devoted to test validity and reliabil- 
ity, and another concerning statistical methodology, the emphasis 
here being primarily on descriptive statistics, With the exception 
of the reliability material, these four chapters differ little from those 
found in the first edition, 

The fifth chapter is a discussion of the test manual and is one of 
two new chapters in the second edition. The heart of the book is 
represented by the two chapters which follow, namely, one concern- 
ing derived scores and another concerning profiles. The former is 
easily the longest chapter in the book and the most technical. Var- 
ious types of derived scores are classified in a rather elaborate man- 
ner and systematically described. This is no doubt one of the most 
complete chapters of its kind. In contrast, the chapter concerning 
profiles is shorter and replete with illustrations taken from common 
standardized tests, 

The next three chapters are comparatively short, one consisting 
of only two pages. This arrangement is unusual, to say the least, 
since the theme of the three is embodied in the title of the first, 
namely, “Common Sense.” The usual list of “do’s and don’ts” of 
test score interpretation and the communication of such scores are 
listed, and the “don'ts” are often illustrated. These three chapters 
along with the summary chapter at the end of the book regularly 
give the reader the feeling that the author is painstakingly elabo- 
rating the obvious. On the other hand, repeated mention of common 
precautions to be taken in test score interpretation may well be in 
order for the kind of audience for which this book is intended. 

Chapter Eleven is the second of the two new chapters added to 
the second edition. Eight criticisms of Psychological testing are 
listed and the author’s reaction to each is provided. Almost all of 
these bear directly upon test score interpretation and consequently 
this chapter is a vital addition to the book. Unfortunately, the 
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scores used in Chapter Six, 

It is easy to understand why this volume would be comparatively 
popular with practicing professionals repe e: lithe of ве 
formal instruction in paychological te 
quite low in almost all instances. Tels approach ы у 
“cookbook”; rarely does he attempt to trest а is отч 
style is informal, if not folksy. Certainly по other is 
is laced so heavily with the use of the first person singular. 

Considering the audience to which this book is directed 
nature of its emphasis, ite relatively short length (200 pages 
distinct blessing. Nevertheless, it is extremely dif&eul to 
recommending that the author strengthen his presentation 
number of additions here and there. For instance, 
why the discussion of test validity is so limited and 
cally cross-referred with other related sections of the 
the use of annotated bibliographies at the end of each 
would definitely improve the volume in that it 
easy route for an eager reader to acquire deeper 
of particular interest to Finally, there 
oversights such as the failure to give proper attention to 
referenced tests, ipsative scores, many of the test 
ited by Buros, and de мин E role played by computers 
in test score reporting i 

In summary, the book is а tidy and simple presentation of tert 
scores and their interpretation, and in many ways is well designed 
for the uninitiated. All others will probably find little in tbe book 
to attract them, unless it is the chapter devoted to derived scores 
and the conversion table for them found in the appendix. 
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George С. Stern. People in Context. New York: John Wiley, 1970. 
Pp xxvi + 402. $13.95. ce 
We live in a world of organizations in which individuals is find 
meaning and expression in their lives chiefly through the ways in 
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which they identify with organizations and the ways in which 
those organizations make possible the realization of individual 
aspirations. A particularly crucial relationship between individual 
and organization is that between a student and his college or uni- 
versity. It, is this relationship that George Stern explores in great 
detail in his book, People in Context, The nature of this relation- 
ship, according to Stern, is a dynamic one which requires the de- 
velopment of a technique which recognizes, as did Kurt Lewin, 
that person and environment be represented in common terns as 
complementary parts of an interaction. 

The means that Stern and his collaborators at the University of 
Syracuse have chosen to explore this dynamic relationship are 
based on the Need-Press model developed by Henry A. Murray. 
The Need-Press model is a taxonomy of psychogenic needs which 
are related to environmental characteristics which match these 
psychogenic needs. Needs are identified as “charactistic sponta- 
neous behaviors manifested by individuals in their life transa- 
tions.” Press, as the complement of needs, is made up of “character- 
istic behaviors manifested by aggregates of individuals in their 
mutual interpersonal transations.” In the situations Dr. Stern in- 
vestigates, the psychogenic needs are those of college students and 
the environmental characteristics are those of colleges and univer- 
sities. The interaction of needs and press may produce congruence, 
growing out of favorable, compatible environmental circumstances 
matching and enhancing personal needs; or forms of dissonance 
which result from a poor match of personal needs and environmen- 
tal characteristics. If it is possible to implement the model by de- 
veloping effective measures of need and press and to demonstrate 
the ecological dynamics of the interactions, it may then be possible 
to influence both individual adjustment and collegiate environment 
во as to bring about productive educational experiences, and those 
tasks are what Dr. Stern is undertaking in this book. 

Much of the book is devoted to an account of the development of 
the measuring instruments chosen by Stern and his collegues to 
develop his ideas. The basis of these instruments, of course, is the 
Murray need catalog consisting of some 30 needs. The measuring 
instrument developed is the Activities Index (AI) derived from its 
original prototype first developed at Chicago in the early 1950's. 
The instrument used in the research reported in this book consists 
of 300 items distributed, 10 items each, over 30 scales based on the 
Murray needs. The environmental counterpart of the personal needs 
consists of a series of Environmental Indexes. The first and most 
important of these indexes and the one most fully reported in this 
research is the College Characteristics Index (CCI). The other En- 
vironmental Indexes are High School Characteristics Index 
(HSCI), Evening College Characteristics Indexes (ECCI), and Or- 


- 
ganisational Charseteristies Index. (OCT). This review will eder 
subsequently only to the College Characteristive I (COT). Both 
Indexes (AI and CCI) are self-adminkstered qomtisamsine The 
reporting of the development and validation. of three inetnemmate 
is extensive in this book. The Indexes tbemaelves are peteted in 
their entirety in one of the appendices. 

The bulk of the book is taken up with the reporting of tbe dher 
acteristics of various college populations and various collage and 
university environmental settings as measuned by the lademes 
tensive study of the characteristics of the Indexes and factor asal 
yses of the Indexes are also reported. The factors derived are wad 
to bring about deeper understanding of collegiate behavior. A st 
of first-order factors extracted from the AI and CCI scales реони 
some 12 factors for AI and 11 for the ССІ. A second-order factor 
analysis of those factors demonstrated three major dimensions 
among the student personality factors. These three dimensions are 
(1) Achievement Orientation which includes factors Self-Astertion, 
Audacity-Timidity, Intellectual Interests, Motivation, Applied In 
teresta; (2) Dependency Needs with factors Applied intereta, 
Constraint-Expressiveness negative, Diffdence-Egolem negative, 
Orderliness, Submissivences, Timidity-Audacity negative, and 
Closeness; (3) Emotional Expression, containing factors Closet, 
Sensuousness, Friendliness, Expressivenest-Constraint, Ерош». 
Diffidence, and Self-Assertion. A fourth Dimension, Educability, 
also appears but is of less magnitude than the other three. 

The second-order factor analysis of environmental factors pro- 
duced two major dimensions: (1) Intellectual Climate containing 
the factors Work-Play negative, Nonvocational Climate negative, 
Aspiration Level, Intellectual Climate, Student Dignity, Academie 
Climate, Academic Achievement, Self-Expression; and (2) Non. 
intellectual Climate with factors Self-Expression, Group Life, Aca- 
demic Organization, Social Form, Play-Work, Vocational Climate. 

The data on which the analyses of the instrumenta (AI and 
CCI) were based and which were used to answer tbe major questions 
to which the research was addressed were taken from the responses 
made by students to the two Indexes at а large number of colleges 
and universities throughout the United States. The major questions 
the research sought to answer were: (1) What are the major psy- 
chometrie properties of the two instruments? (2) Can the factor 
scores be used to classify schools and student bodies? (3) Are the 
measures of personality and institutional press related to -— 
tional objectives and their achievement? (4) How do measures 
environmental press for an institution as a whole relate to those z 
subcultures within the institution? (5) How is ай 
tween personal needs and environmental press best expressed 
quantified? The answers to these questions are explored 
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and reported in the book. The nature of the pattern of needs and 
press revealed by the instruments and extended analysis of data 
produced by them has already been referred to above. The char- 
acteristics of special classes of schools and student bodies are elab- 
orated in considerable detail and a reader interested in these phe- 
nomena has a rich collection of data to examine. There are, for ex- 
ample, clear and interesting differences among all of the basic classes 
of institutions investigated: Independent Liberal Arts, Denomina- 
tional Liberal Arts, University Affiliated Liberal Arts, Business 
Administration, Engineering and Teacher Training. The differ- 
ences are quite striking, for example, between the Independent, 
Denominational, and University Liberal Arts Colleges. The Inde- 
pendent Liberal Arts are markedly higher on Intellectual Climate 
and markedly lower than the others on Non-Intellectual Climate. 
On personality factors there are also quite striking differences be- 
tween male and female students. The difference is greatest for 
Achievement Orientation, but is also marked for a number of fac- 
tors in the dimensions Dependency Needs and Emotional Expres- 
sion. As may be expected, a vast number of comparisons between 
various classes of schools and student categories can be made when 
one considers the large number of students and institutions from 
which data were gathered in this study. For the academic reader in 
particular these comparisons are fascinating and illuminating. This 
review of course can do no more than hint at the studies and com- 
parisons presented in the book. In addition to the studies already 
briefly described, there are descriptions and analyses of denomina- 
tional colleges and universities, of three particular institutions, of 
the variability among schools within a large university (Syracuse) 
and of various college climates. 

Since this book, in its stated purposes and in the means it de- 
scribes for achieving those purposes, seeks to present to its readers 
in the education profession some fundamental understanding of 
colleges and universities and some ways of making those institu- 
tions more effective, it is fair to judge it on its contribution to those 
goals. Technically the work presented in this book is a most sub- 
stantial achievement. The work that went into the development of 
these scales and their use in the measurement and description of 
collegiate behavior and settings is little short of prodigious. The 
design of the studies and the work involved in carrying them out 
is technically competent at the highest level. The book is well docu- 
mented and referenced at all points and the professional scholar in 
the field will find it extremely useful. For the researcher who may 
wish to use some of the instruments and techniques in his own insti- 
tution the enormous amount of data and technique presented make it 
a marvelously useful book. 


Conceptually it seems to this reviewer that the book has 
limitations. The techniques used are effective for 
scribing institutions and The device of using а single 
of concepta based on a 
scribe the relationship between person 
effective. But the technique results in a static 
students and schools and the results of their interaction, 
tion of both students and institutions are limited to 
lated scores on instruments on which the responses 
dents are confined to yes and no answers to rather 
questions. Even though this is a useful method for many 
it has qualities of barrenness when it is the only method used 
describing and understanding the dynamics of human behavior 
it occurs in such fascinating settings as colleges and universities. 
The descriptions and conclusions in the book, though brilliant 
and fascinating, rest solely on the accumulation and 
of psychometric data. The report is almost wholly unrelieved by 
the reporting of incidents and cases which illustrate the uses of this 


functioning. There is only one case study reported—one, 
tally, which demonstrates rather convincingly the 
techniques to predict and understand the behavior of an 
student. More such studies would have helped 
strate the usefulness of the work. Some accounts 
actions of various groups in various settings might 
illuminating. The techniques used by Stern, even though 
fully done, as is the case here, have their limitations 
comes to understanding the dynamics of human behavior. 

There are some important uses to which these ideas 
ments may be put. There has always been & rather 
this country that students and colleges are basically 
alike. Though this idea has been often suspect, there 
little in the way of appropriate conceptual schemes and 
measuring instruments to give substance to the su 
provide a real means to think and talk about such differences. 
been particularly true of faculties that after being trained in 
intellectual colleges and in the rarified atmosphere 
schools they have been blind to the particular qualities of 
students or have assumed them to be inferior versions of 
classmates or the students they taught while graduate assistants. 
For educational reasons it is important that these faculties а 
to understand their students and the psychological climates 
which they operate. The Stern instruments and the Need-Press 
conceptual scheme offer а superb way for them to do so. 
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A FACTOR ANALYTIC INTERPRETATION STRATEGY 
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Tue purpose of this paper is to illustrate the use of a strategy 
for determining the common factors in a set of data. C. Harris 
(1967) suggested using several different, computing algorithms for 
the initial solution, obtaining derived solutions, both orthogonal and 
oblique, comparing the results, and regarding as the important sub- 
stantive findings those factors that are robust with respect to 
method. This paper illustrates a way of comparing the results. 

The factor results used for this illustration of a factor analytic 
interpretation strategy are the reanalyses, by seven different. solu- 
tions, of the data from nine of the Guilford studies as reported by 
C. Harris (1967). The initial component and factor methods used 
are Incomplete Principal Component (Hotelling, 1933), Alpha 
(Kaiser and Caffrey, 1965), a Jéreskog method (1963, 1967), and 
Harris R-S? (1962). The Jóreskog method used for Matrices 08 
and 23 is his Unrestricted Maximum Likelihood Factor Analysis 
(UMLFA) procedure (1967) using a critical value of .05; Jóres- 
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kog's early procedure (1963) was used for the other seven matrices. 
These four methods provide а component solution (Incomplete 
Principal Component), a factor solution with a statistical basis 
(Jóreskog, 1963 or UMLFA), and two factor solutions with a psy- 
chometrie basis—one for a minimum number of factors (Alpha) and 
one for a maximum number of factors (Harris R-S2). It may be 
noted that these three factor methods are scale-free. Derived ortho- 
gonal solutions were obtained for each of the four initial solutions 
using the Kaiser normal varimax? procedure (1958) and derived 
oblique solutions were obtained for the first three initial solutions 
using the Harris-Kaiser independent cluster solution (1964). An 
oblique solution was not obtained for the Harris R-S? method Since 
it would have certain correspondences to the oblique solution ob- 
tained from the Jóreskog (1963) method. 
The nine Guilford matrices that were reanalyzed are: 


08 Creative thinking 

09 Evaluative abilities 

12 Planning 

14 General reasoning 

16 Reasoning, creativity, and evaluation 
(Subdivided into three—16A, 16B, and 16C) 

22 Problem-solving abilities 

23 Cognition and convergent production 


th 
А 


common factors аге utiliz 
trated in this paper; thus, 
than .30 (absolute) on one 
tables. Note that Guilford 
critical value in interpretin 


= 
Fries ee bn p (1969) have compared varimax rotations with 
be К ful Е eri s и rejected the former as not 
unsatisfactory since these do not терг. s the Sie mould find ишсе» 
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TABLE 1 
Numbers of Initial and Derived Common Factors for the Various Methods 


Initial Common Common 
Matrix Factor Method Factors Factors Factors 
08 Incomplete Principal 
Component 14 13 14 
Alpha 14 11 13 
UMLFA 19 10 14 
Harris Ң-8 28 10 
09 Incomplete Principal 
Component 15 12 14 
Alpha 15 18 13 
Jóreskog а 
Harris R-S! 39 11 
12 Incomplete Principal 
Component 13 12 13 
А1рһа 13 10 12 
Jóreskog 7 7 7 
Harris R-S? 30 7 
14 Incomplete Principal 
Component 6 6 6 
Alpha 6 6 6 
Jóreskog 4 4 4 
Harris R-S? 13 7 
16A Incomplete Principal 
Component 6 5 6 
Alpha 6 5 5 
Jóreskog 4 4 4 
Harris R-S? 16 7 
16B Incomplete Principal 
Component 6 6 5 
А1рһа 6 6 5 
Jóreskog 4 4 4 
Harris R-S? 14 7 
16C Incomplete Principal 
Component 6 6 6 
Alpha 6 6 б 
Jóreskog 6 5 6 
Harris R-S? 14 6 
22 Incomplete Principal 
Component 12 11 11 
Alpha 
Jöreskog Y T 7 
Harris R-S? 24 8 
23 Incomplete Principal 5 
Component 5 5 5 
Alpha 5 5 b 
UMLFA 5 5 
Harris R-S? 17 $ 


je nen ви eee 


» Went to p — 1 factors. 
» Did not converge. 
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The procedure involves attempting to find the common factors 
(components) that are similar over solutions. This was done by 
starting with a derived orthogonal component from the Incom- 
plete Principal Component initial method. The reason for this 
choice is that this solution tends to include more variables with co- 
efficients greater than .30 on a particular component than any of 
the other solutions. Then for each other derived orthogonal solu- 
tion and for each derived oblique solution, a common factor was 
searched for that seemed to be similar to the component selected, 
particularly with respect to the large coefficients. 

The next step involved determining those factors (components) 
that are robust with respect to method—factors which tend to in- 
clude the same variables across methods. A variable was considered 
relevant to a factor if it had a coefficient greater than .30 (abso- 
lute) on that factor. A comparable common factor (CCF) was de- 
fined as one having two or more of the same relevant variables on 
at least five of the seven derived factors (components). This means 
that a comparable common factor is defined by more than two dif- 
ferent initial solutions and by both orthogonal and oblique rota- 
tions. Thus, no one initial method can account for a variable’s re- 
jection and no one derived method can account for a variable’s 
acceptance on a comparable common factor. Note that for the two 
matrices for which one of the initial solutions was not available, 
Matrix 09 and Matrix 22, а comparable common factor is defined 
as one having two or more of the ваше relevant variables on at 
least four of the five solutions, 

_Two other types of factors may be found. A comparable spe- 
cific factor (CSF) is defined as one having only one (the same) rel- 
Moy Уеа at least five of the solutions. A noncomparable 

is defined аз one not having any one or more of the 

same relevant variables on at least five of the solutions. 
pig oF Баа е of comparable common factors, 
the nine matrices. The duh аи ош оп a 
Gaited ЫШ : vil er o Common factors obtained by 
matrix is also given in Table 9. 
: was chosen to illustrate the fairly 
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3 The application of the interpretation strat in thi 
nine of the Guilford studies can be found in ow ca кое oe 
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TABLE 2 
Number of Factors for Each Matriz 
аан 
Reanalyses 
Guilford 
Comparable Comparable 
Common Specific Noncomparable Common 
Matrix Factors Factors Factors Factors 
08 10 0 8 15 
09 10 1 10 M 
12 7 2 4 14 
14 6 0 1 9 
16A 4 1 5 11 
16В 5 0 Б] 9 
16С 5 0 5 10 
22 7 0 7 13 
23 5 0 1 13 


close agreement across methods that can be secured among various 
factor solutions. Matrix 08 was chosen as a matrix for which the 
various factor solutions are in least agreement. Of the nine ma- 
trices studied, the results for 08 and 09 seemed to be the most dis- 
crepant across the seven derived solutions. Of these two, Matrix 
08 was chosen for presentation here because one initial factor 
method was not available for Matrix 09. For 08 the various solu- 
tions agree in part but for some of the factors the results are quite 
diverse. Table 3 contains the results for Matrix 23 and Table 4 
the results for Matrix 08, The relevant variables are in capital 
letters and the nonrelevant variables (noise?) are in small letters. 
The order of the factors in the tables is arbitrary within each of the 
three types of factors (CCFs, CSFs, and NCFs). Guilford’s re- 
sults are presented in each table with the factors of the reanalyses 
with which they seem to agree most closely. 

For Matrix 23 the factors are rather robust over solutions. There 
are five comparable common factors for the 30 variables in this 
matrix and one noncomparable factor. This is in contrast to the 13 
common factors obtained by Guilford. 

As shown in Table 4, the results for Matrix 08 are not as robust 
over solutions as they are for Matrix 23; the results from the vari- 
ous solutions are comparable (in the sense defined for this strat- 
egy) for some factors but not for others. It should be pointed out 
here that for both 10 and 12 factors the UMLFA method yielded 
an improper solution since the unique variance for variable num- 
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TABLE 4 
Factor Results for Matriz 08* 


Reanalyses 
Orthogonal Oblique Guilford 


COMPRARABLE COMMON D 
FACTOR 1 
35 PUNCHED HOLES 567 49 50 52 48 37 34 |45 
48 PRACTICALJUDGMENT 60 47 38 46 68 56 37 |32 
51 MECHANICAL PRIN- 
CIPLES 80 71 378 69 80 78 80 |54 
52 ARITHMETIC REASON- 
ING 46 44 5p) 49" 359 95 38 
16 Match Problems 41 38 34 
34 Word Matrices 31 
COMPARABLE COMMON o F 
* FACTOR 2 
36 MUTILATED WORDS 40 36 38 33 50 35 


37 STREET GESTALT 


COMPLETION 70" 62 e БӨЛ E OEE 7 44 
38 PERCEPTUAL SPEED 64 58 54 57 65 47 Ў 
41 UNUSUAL DETAILS 34 33 33 31 3 
42 PENETRATION OF 
CAMOUFLAGE 76 6 68 67 80 72 55 |45 40 
»"47 SPATIAL ORIENTATION dr 
(PART I) 60 54° 50 53 50 44 
35 Punched Holes 32 


* Decimals have been omitted. 


Key to Factor Solutions of Reanalyses: 
I Incomplete Principal Component 
II Alpha 
Ш UMLFA 
& ІУ Harris R-S? 


Key to Guilford Factors: 
D Visualization 
C Perceptual Speed 
F Closure 
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TABLE 4 (Continued) 


COMPARABLE COMMON 
FACTOR 3 


49 NUMERICAL OPERA- 
TIONS (PART I) 
50 NUMERICAL OPERA- 
TIONS (PART II) 
52 E eras REASON- 
NG 


1 Sentence Analysis 
44 Ship Destination 
47 Spatial Orientation (Part I) 
COMPARABLE COMMON 
FACTOR 4 


1 SENTENCE ANALYSIS 39 
2 PARAGRAPH ANALYSIS 49 
27 SENTENCE GESTALT 
(OMISSIONS) бу 
33 SENTENCE SYNTHESIS 65 
43 VOCABULARY 71 
67 
42 


Er 
8 
^ 
E 


46 INFERENCE TEST 

53 SENTENCE GESTALT 

11 Number Associations 
(Uncommonness) 

M Circle Square I 

15 Circle Square II 

17 Sign Changes 

18 Implied Uses 

21 Associations IT 

28 Word Transformation 

32 Concept Synthesis 32 37 

34 Word Matrices 35 

44 Ship Destination 36 

51 Mechanical Principles 

52 Arithmetic Reasoning 34 39 


B 588?* 58 
BESR 
& 
& 
Ф 
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Key to Guilford Factors: 
B Numerical Facility 
A Verbal Comprehension 
E General Reasoning 
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TABLE 4 (Continued) 


COMPARABLE COMMON 
FACTOR 5 


24 APPARATUS TEST 69 59 60 61 10 67 60 
25 SOCIAL INSTITUTIONS 

(DIRECT) 80 67 375 6 0 и з 
13 Consequences (Remote) 37 31 
22 Unusual Uses 32 
41 Unusual Details 31 
44 Ship Destination 31 34 
COMPARABLE COMMON 

FACTOR 6 


28 WORD TRANSFORMA- 

TION 7-2 59 52 
40 DISARRANGED WORDS 72 54 57 
53 SENTENCE GESTALT 55 49 48 
11 Number Associations 

(Uncommonness) 
14 Circle Square I 
15 Circle Square II 38 31 
27 Sentence Gestalt 

(Omissions) 
36 Mutilated Words 31 
39 Controlled Associations 31 32 


MPARABLE COMMON 
FACTOR 7 


B23 
КЕ 
5S3 
SS 


16 MATCH PROBLEMS 44 32 32 41 43 45 
45 SYMBOL MANIPULA- 
TION 62 40 44 6 49 36 
17 Sign Changes 
23 F-Test —36 
35 Punched Holes 33 
F Perceptual ке 35 32 35 45 
Disarranged Wo 
43 Vocabulary =31 -87 —4 


53 Sentence Gestalt 


53 Sentence бна. E 


Key to Guilford Factors: 
N Sensitivity to Problems G Word Fluency — 
Н Associational Fluency K Adaptive Flexibility 
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TABLE 4 (Continued) 


Orthogonal Oblique 
I JY. IE. IV, I ID HI 


COMPARABLE COMMON 
FACTOR 8 
29 GESTALT TRANSFOR- 
MATION 6001 50 48 38 56 39 B$ 
30 PICTURE GESTALT от 48 45 52 79 69 46 
19 Quick Responses 
(Uncommonness) —87 
23 F-Test —36 
31 Object Synthesis 
41 Unusual Details 35 32 
48 Practical Judgment 
COMPARABLE COMMON 
FACTOR 9 
5 IMPOSSIBILITIES 50 35 44 45 39 
6 PLOT TITLES (LOW 
QUALITY) 70 58 BE ва TT 
8 COMMON SITUATIONS 75 50 50 69 68 
9 BRICK USES (FLUENCY) 74 48 49 72 66 
12 CONSEQUENCES TEST 
(LOW QUALITY) DC 65 86 80 
1 Sentence Analysis 38 32 
3 Figure Analysis 43 35 
4 Figure Concepts 
(Uncommonness) 33 
13 Consequences Test 
(Remoteness) 36 
22 Unusual Uses 44 
24 Apparatus Test 35 
31 Object Synthesis 34 
39 Controlled Associations 52 
41 Unusual Details 32 
Key to the Guilford Factors: 
M Redefinition 


I Ideational Fluency è 
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TABLE 4 (Continued) 


COMPARABLE COMMON 
FACTOR 10 


10 BRICK USES (FLEXIBI- 
LITY) 55 


18 IMPLIED USES 52 

22 UNUSUAL USES 39 
39 CONTROLLED ASSOCIA- 

TIONS 41 

1 Sentence Analysis 31 


3 Figure Analysis 
4 Figure Concepts 
(Uncommonness) 
5 Impossibilities 
6 Plot Titles (Low Quality) 
7 Plot Titles (Cleverness) 
8 Common Situations 
9 Brick Uses (Fluency) 
11 Number Associations 
(Uncommonness) 
13 Consequences Test 
(Remoteness) 
19 Quick Responses 
(Uncommonness) 
20 Associations I 
23 F-Test 
24 Apparatus Test 
25 Social Institutions (Direct) 
31 Object Synthesis 
32 Concept Synthesis 34 
34 Word Matrices 44 
37 Street Gestalt Completion 
38 Perceptual Speed 
41 Unusual Details 


Key to the Guilford Factors: 
J Originality 
L Spontaneous Flexibility 


Reanalyses 
Orthogonal 

52 50 53 
38 34 33 
т 69 63 
50 47 
35 39 35 
т ^4 А 
53 51 55 
46 53 41 

34 
44 50 42 
57 08 54 
62 63. 49 
DL! 1148. 144 
62 65 65 
3з 32 34 
45 43 38 

35 
3з 40 

35 
32 39 

35 
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| КЕП! 
„15413.11 н ШЫ | 
ҮШ 


NONCOMPARABLE 


11 Number Associations 
12 Consequences Test 
,19 Quick Responses 


NONCOMPARABLE 
FACTOR 18 


FACTOR 16 
17 Sign Changes 47 о ми 
23 P-Teat -и 
29 Gestalt Transformation їз 
31 Object Synthesis -% -о ~64 
32 Concept Synthesis 35 
36 Mutilated Words 39 Ld 
NONCOMPARABLE 

FACTOR 17 
4 Figure Concepts 

(Uncommonness) 31 и 


(Uncommonness) 42 “ 
(Low Quality) -35 


(Uncommonness) u 
Associations I 64 Ч 
32 


segs 


ber 27, Sentence Gestalt (Omissions), was equal to or less than 02. 
Jöreskog suggests partialling out any variables that have a unique 
variance that is essentially zero (2.02). It was decided, 
to remove this variable from the intercorrelation matrix. The solu- 
tion given here for UMLFA is for 15 factors for 52 variables, with 
variable number 27 omitted. There are 10 comparable common 
factors for the 53 variables in Matrix 08 and eight noncomparable 
factors. Guilford obtained 15 common factors for this set of data. 
The results of the application of our factor analytic row pe 
tion strategy to the remaining seven matrices are summarized 
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in this paper. As mentioned earlier the seven derived solutions 
seemed to be very similar for Matrix 23. They are most discrepant 
for Matrices 08 and 09. The results seem to be fairly similar for 
Matrices 14 and 16B. For Matrices 12, 16А, 16C, and 22 there is 
some close agreement and some diversity. The comparable com- 
mon factors of Matrices 09 and 22 seem to have relatively few rel- 
evant variables. 

In general, the number of comparable common factors is similar 
to the smallest number of common factors in the derived solutions 
of the reanalyses. For one matrix. (09) the number of CCFs is one 
less than the smallest number of common factors obtained for any 
one derived solution. For six of the matrices (08, 12, 16A, 16C, 22, 
and 23) the number of CCFs is equal to the smallest number of 
common factors obtained for any one or more derived solutions. 
The number of CCFs is greater than the smallest number of com- 
mon factors for a single derived solution for two of the matrices 
(14 and 16B). 

The number of comparable common factors for the data in any 
one of the matrices is always considerably fewer than the number 
of common factors obtained by Guilford. In general, a few of the 
CCFs agree rather closely with common factors obtained by Guil- 


TABLE 5 
Intercorrelations of Oblique Factors for Matriz 23 


Common Factor 1 2 3 4 
TOT Tdi o SS —_ 
п 05 
ш 54 
3-1 28 48 
п 35 60 
ш 69 70 
41 25 29 
п 37 39 E 
ш 58 55 БЕ 
5-1 23 52 2 
1 
ш = % 66 53 
ы ы O Me ss ao 
* Decimals 
* Kay te titel red t 
Т Incomplete Principal Component 
II Alpha 
III UMLFA 
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ford. In many instances two or more of his common factors co 
alesce into one comparable common factor, 

For all of the initial methods, the derived oblique solutions tend 
to drop variables with small coefficients from the common factors. 
Thus, more variables would be relevant to а comparable common 
factor, but with small coefficients, if only derived orthogonal solu- 
tions were used. Two good examples of this can be seen in CCF 3 
of Matrix 23 (Table3) and CCF 4 of Matrix 08 (Table 4). 

The intercorrelations of the oblique factors are given, by initial 
method, in Table 5 for Matrix 23 and in Table 6 for Matrix 08. 


TABLE 6 
Intercorrelations of Oblique Factors for Matriz 08* 


Comparable 
Common Factor 1 2 3 4 5 6 T 8 9 
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I Incomplete Principal Component 
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These are included as an illustration of the possible comparability 
in some cases and diversity in other cases of the correlations of the 
derived oblique factors from the various initial methods that are 
included on the same CCF. А 

А strategy for determining comparable common factors in a 
given set of data has been illustrated. For one matrix, consid- 
erable agreement among the several derived solutions studied was 
demonstrated, but these results did not reproduce the ones secured 
initially by Guilford. For Matrix 23 this study offers a possi- 
ble interpretation that does not support in detail Guilford's Struc- 
ture of Intellect Model. For the other matrix, there was only a 
limited consistency among the derived solutions. The domain of 
creative thinking as defined by Matrix 08 appears to be unclear. 

For future studies we would recommend obtaining both derived 
orthogonal and derived oblique solutions for each of these initial 
factor methods—Alpha, Harris R-S?, and Unrestricted Maximum 
Likelihood Factor Analysis. A comparable common factor could 
then be defined as one having two or more of the same relevant vari- 
ables on at least four of the six derived factors. 
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A COMPARATIVE STUDY OF SOME SELECTED 
METHODS OF PATTERN ANALYSIS'* 


LOUIS L. McQUITTY 


University of Miami 
Coral Gables, Florida 


Tuts paper compares selected methods of pattern analysis by 
the author and generates an improved method which incorporates 
many desirable features of the several methods and simultaneously 


eliminates a number of undesirable features of the other methods. 


Iterative, Intercolumnar Correlational Analysis 


Iterative, Intercolumnar Correlational Analysis was developed 


out of a theory of types (McQuitty and Clark, 1968). A type is 
defined as a category of objects of such a nature that every object 
in the category possesses a common and unique combination of 
characteristics; every object in the category possesses all of these 


the characteristics. 


characteristics, and no object not in the category possesses all of 
the characteristics; a nonmember may possess some but not all of 


Prior to the development of the iterative method many other 
methods for the isolation of types, as defined above, had been de- 
veloped by the author, All of these methods, including the iterative 
method, start with a matrix of interassociations between objects. 


«Every object is assessed in terms of sel 


ected characteristics and the 


relation of every object with every other object is recorded in terms 


of a numerical index to yield a m 
objects. 
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The matrix of interassociations is analyzed in a fashion designed 
to classify the objects into categories which fulfill the above defini- 
tion of types. The methods of analysis, prior to the iterative one, 
begin by classifying objects into many categories at the bottom level 
of classification; each category contains only a few objects and the 
members of each category have relatively many characteristics in 
common. As the analysis proceeds objects are classified into larger 
and larger categories at successively higher and higher levels with 
the members of the categories agreeing on fewer and fewer charac- 
teristics. The consolidation at each successive level is realized, for 
the most part, by combining categories of the next lower level. 
Errors made at a lower level can be carried to a higher level. 

In most of the methods, the classificatory decisions are based on 
the highest entry in each column in the matrix of interassociations. 
This procedure neglects the many other entries in а matrix, which 
might be helpful in increasing the validity of the decisions. 

In addition to the general purpose of creating an improved typo- 
logical method, Iterative, Intercolumnar Correlational Analysis was 
directed to two specific purposes, viz., (a) to increase the validity of 
classification decisions, and (b) to utilize all of the entries of the 
matrix in making these decisions. It has, however, yet another 
unique characteristic. It classifies objects into a hierarchical system 
from top down rather than bottom up. The method divides the orig- 
inal matrix into two submatrices, each submatrix into two addi- 
tional submatrices, and thus the process continues until at the bot- 
tom level every object is usually separated as a single object, 


The Method 


In order to apply the method, a matrix of interassociations be- 
tween objects is required. The first step is to compute the correla- 
tion between the corresponding entries of any two columns, i and j. 
This index is a measure of the extent to which two objects vary 
jointly in their correlations with the other objects of the matrix. 
It is called the first intercolumnar correlation between Objects i 
and j. The first intercolumnar correlation is computed for every 


object with every other object to yield the first intercolumnar cor- 
relation matrix. 


The second intercolumnar correlation be 


1 0 tween Objects 4 and jis 
obtained by computing the correlation b 


etween the corresponding 
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entries of Objects ї and ј of the first intercolumnar correlation ma- 
trix. It is computed for every object with every other object to pro- 
duce the second intercolumnar correlation matrix. 

As the process of generating new intercolumnar correlation ms- 
trices proceeds, a matrix is usually obtained which contains only 
correlations of +1 and —1. There are usually two sets of +1%. 
Each set of +1’s defines a submatrix. The —1's mediate between 
objects of the two submatrices. 

Table 1 reports a matrix of original associations to which Itera- 
tive, Intercolumnar Correlational Analysis was applied. The first, 
third, and fifth intercolumnar correlation matrices are reported in 
Tables 2, 3, and 4 respectively. Column 1 of Table 4 shows that 
one submatrix is composed of Object 1 and all of the other objects 
having a correlation of +1 with it. Object 2, which has а correla- 
tion of —1 with Object 1, defines the other submatrix; it is com- 
posed of Object 2 and all of the other objects which have а correla- 
tion of +1 with it. Each of the objects of one submatrix has а 
correlation of —1 with each of the objects of the other submatrix. 

In continuing the analysis, the above procedures are applied to 
the submatrices. For this purpose, the +1’s of the submatrices are 
replaced by the corresponding entries from the original large matrix. 
Every submatrix is generally divided in the same fashion as just 
outlined for the original matrix. The steps are continued until the 
analysis is completed. 


Some Weaknesses and Improvements 


Two of the possible weaknesses of the above method are that it 
does use all of the data in a matrix or submatrix and it classifies 
from the top down. Any error made in any bifurcation is not cor- 
rected in the further analysis. 

Some indices are smaller than other indices and are, therefore, 
less reliable. An alternative approach is to develop а method de- 
signed to use only the more reliable indices. s 

One approach would be to limit the computation of intercolumnar 
correlations to the higher indices of every matrix or submatrix 
being analyzed. This approach has the disadvantage of shortening 
the range and thereby tending to lower the reliability of the inter- 
columnar correlations. 
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TABLE 4 
Fifth and Final Iteration from Table 1 
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Note.—Data for this table taken from McQuitty and Clark, 1908. 
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Multiple Linkage Analysis | 

A method which uses only the more reliable indices has recently — 
been developed by the author (McQuitty, 1969). It is an extension 
of Elementary Linkage Analysis and is called Multiple Linkage 
Analysis. 

This method uses the minimal number of the larger indices of 
association required to accomplish two goals: (1) to isolate two 
central cores of highly interrelated objects in every matrix or sub- 
matrix, and (2) to associate every object with a central core. By 
this approach the method avoids the long “chains” of association 
sometimes obtained with Elementary Linkage Analysis, wherein 
Object z is associated with a central core of objects through a 
link, first with s, which is linked with T, which is linked with ¢, 
which is associated with a core. 

This new method of Multiple Linkage Analysis and the iterative 
method were applied to a common set of data. The data were chosen 
as crucial for testing the methods; the data contain many ties and 
near ties, and some methods of analysis will generate additional ties 
in the course of the analysis. The data are those already reported in 
Table 1. 

The results from the two methods are shown in Figures 1 and 
2. One of the points of disagreement occurs at the lower left hand 
Portion of the two figures. The iterative method, Figure 1, classifies 
Objects 3, 6, and 9 into a triad which is then joined by a pair 13 and 
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Figure 1. A Hierarchical Classifi ati : 
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Figure 2. A Hierarchical Classification of the Objects by Multiple Linkage 
Analysis. (Reprinted from McQuitty, 1969.) 


16. The new method, on the other hand, as can be seen in Figure 2, 
first classifies Objects 3, 6, and 16 (rather than 9) into a triad which 
is then joined by Object 9. 

The differences in classification are based on small numerical dif- 
ferences in the data in relation to unique features of the methods 
of analysis; many of the differences in results of this kind are within 
chance errors. 


A Revision of Rank Order Typal Analysis 


A method is needed which portrays options of the above kind in 
classifying objects. It should show the indices on which the options 
are based, and it should resolve all conflicts at higher levels of classi- 
fication. 

Another new method by the author has merit for this purpose. It 
is a revision of Rank Order Typal Analysis and is called Relaxed 
Rank Order Typal Analysis (McQuitty 1971). Both the original 
and the revised methods derive from a definition of types. A type is 
a category of objects of such a nature that every object of the cate- 
gory is more like every other object of the category than it is like 
any object of any other category. This strict definition of a type 
precludes the finding of any but a few small types in most sets of 
empirical, psychological data. The revised method relaxes the defi- 
nition sufficiently to isolate types in most sets of data. 

An added feature of relaxing the definition is that the method gives 
alternative classifications for certain objects which are highly similar 
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(within error limits) and reports resolutions to the alternatives at 
higher levels of classification. 

The results from Relaxed Rank Order Typal Analysis for the 
same data as above are reported in Figure 3. The figure portrays 
two conflicting triads of Objects 3, 6, and 9 versus Objects 3, 16, and 
9. This example is one of the areas of conflict reflected by the other 
two methods in Figures 1 and 2. The conflict is resolved in Figure 
3 by combining the four objects of the two triads into a single tetrad 
at the next higher level. 

Figure 3 resolves other conflicts between Figures 1 and 2. Figure 
3 shows, for example, that Object 2 has two separate classifications, 
One corresponding with its classification in Figure 1 and the other 
corresponding with its classification in Figure 2. Figure 3 shows 
the level at which these two alternative classifications are resolved. 


A Method Which Adjusts the Classification Criterion 
to the Requirements of the Data 
Figure 3 does not, however, resolve all of the conflicts between 
Figures 1 and 2. A method is needed whereby it is easy to adjust the 
criterion for classification to the requirements of the data. 


OBJECT CODE NUM BERS 


Figure 3. A Hierarchical Classification by Relaxed Rank Order Typal 
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А revision of an older method has merit in relation to the present 
problems, viz., Hierarchical Classification by Reciprocal Pairs. A 

*& — mevision of it was developed to include an analysis of ties, where f 
^V de highest with j and j is highest with i, and t is highest also with k 
and К is highest with 1, as illustrated in Table 5 (McQuitty, 1966; 


TABLE 5 
A Tie in Reciprocal Pairs 


— Highest entry in а column, 


MeQuitty, Price, and Clark, 1967). The problem of ties is solved 
by classifying i with j and i with k. 

That this method works effectively with data containing ties 
is illustrated in Figure 4, which reports the results from applying it 
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to the same data as was analyzed by the other methods. Because of 
ties, Object 3 is classified separately with Objects 6 and 16; Object 
9 is classified separately with Categories 3-6 and 3-16, and at the 
next higher level of classification, i.e., the third level, all of these con- 
flicts are resolved because Objects 3, 6, 9, and 16 enter а common 
tetrad. In addition to its classification with Categories 3-6 and 3-16, 
Object 9 is also classified separately with Object 13. This latter con- 
flict is resolved at the fourth level of classification. This method re- 
solves problems, as just illustrated, in a fashion similar to Relaxed 
Rank Order Typal Analysis. 

An advantage of the reciprocal pairs method is that it portrays 
clearly all of the indices on which the classification is based, in- 
eluding the tied values, and reveals to the investigator the multiple 
classifications and usually the ways in which they are resolved; the 
objects in conflict usually enter a common category, sooner or later. 

The definition of a reciprocal pair can be relaxed so that a pair 
сап be accepted as reciprocal if i is either highest or second highest 
with j and j is either highest or second highest with i. This approach 
portrays more options similar to those shown in Figure 4. 

The relaxed definition was applied to the set of data already re- 
ported. A portion of the results are shown in Figure 5. At the 
bottom level of classification, three options are reported for each of 
the Objects 2, 12, 10, and 20. All of them are resolved at the second 
and third levels of classification. 

Most all data are subject to chance errors. Strict methods of pat- 
tern analysis can yield false classifications based on small differences. 
A desirable feature of Hierarchical Classification by Reciprocal 
Pairs is that its requirements can be relaxed so that optional classi- 
fications will appear to the investigator rather than being hidden by 
resolutions which may be based on chance errors. This method 
usually resolves the options at a higher level of classification. 

In addition to using the highest and second highest values as 
Peg here, reciprocity сап be extended to include third, fourth, 

‚ ete., highest values. Too many alternative classifications can, 
however, be generated; confusion then results, 

One of the reasons why Hierarchical Classification by Relaxed 
Reciprocal Pairs can lead rapidly into confusion is that the level of 
relaxation is spread throughout the data, If appropriate relaxation 
is to be realized and confusion avoided, the level of relaxation must 
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OBJECT CODE NUMBERS 
Figure 5. A Hierarchical Classification by Relaxed Reciprocal Pairs. 


be adapted to the differential requirements within the data as the 
analysis proceeds. 


Hierarchical Classification by Multi-Level Reciprocity 
A method of relaxed reciprocity which adapts differentially to 
the needs within data and thereby avoids confusion, and at the same 
time solves the other problems posed in this paper, has just recently 
been developed by the author (McQuitty, 1970). 


Unique Features 

Unique features of this method are (1) a specification of the level 
of reciprocity used at every stage of the analysis, (2) a gradual 
increase in the level of reciprocity (more relaxed classification) as 
Tequired by the characteristics of the data, (3) an adjustment of the 
level of reciprocity to both the validity of the data and the size of the 
categories into which objects are classified, and (4) an ability to 
Teject an object for classification because it is inappropriate to the 
categories generated by the other objects. . 
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The novel feature of this method is illustrated with the data of 
Table 1. The data of Table 1 are converted to ranks within columns 
and are reported as rank orders in Table 6. Object 3, for example, is 
reported in Column 1 to be second most like Object 1. 

In the case of tied values, the highest rank (smallest numerical 
value) is assigned to all of the tied values. In this approach, scores 
of 30, 29, 29, 29, and 28 would have ranks of 1, 2, 2, 2, and 5 rather 
than 1, 3, 3,3, and 5 (as is the usual practice). 

Table 6 reports also the level of reciprocity of every object with 
every other object up to and including a level of 19. For example, the 
second value in both Row 1— Column 3 and Column 1—Row 3 is 
eight. Tt is the reciprocity level for the pair of objects 1-3. Tt is the 
larger number of the two ranks, a rank of two for Object 3 in 
Column 1 versus a rank of eight for Object 1 in Column 3. The rec- 
iprocity level is, therefore, eight for these two objects. 

Stated in general terms, the level of reciprocity is the value of the 
larger number of the two ranks mediating between two objects as 
reported in their respective columns. 


Two Versions 


The reciprocity level is used as the classification criterion. It can 
be applied in two slightly different ways: (1) successive linkages, 
or (2) core attachments. 

Successive linkages. In successive linkages, one starts with a cri- 
terion Of one. Any two objects which have a reciprocity of one 
between them constitute a beginning category. Every other object 
Which has a one with either of the two members of the category is 
added to the category. 

Other pairs of objects with a criterion of one between them are 
Sought and built up in the same fashion as above, using a criterion of 
one. 

After all such pairs have been exhausted and each has been built 
to its maximum under a criterion of one, the criterion is increased by 
units of one and each category built to its maximum under each size 
of the eriterion. The analysis is complete when every object has been 
assigned in terms of its smallest reciprocity value. 

Core attachments. In core attachments, the cores are defined as 
categories generated at any level of the classification criterion; the 
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most basic approach is to restrict them to the largest categories 
which ean be realized by a criterion of one. 

Tbe core approach differs from successive linkages by the fact 
that any object brought into the core must satisfy the criterion 
with respeet to every object in the соге, not just some one of them. 
И tbe initial pair is composed of Objects i and j and we are using a 
criterion of one, then Object z qualifies to join the pair if and only 
if it has а reciprocity level of one with each i and j. Likewise, object 
y then qualifies if and only if it has a reciprocity of one with each i, 
j, and т. By contrast, successive linkages requires entry objects to 
satisfy the criterion with at least one object of the category. 

After core categories have been isolated and built to their maxi- 
mum sizes, all other objects are assigned to them using the method of 
successive linkages. 

In either approach, multiple classification due to near ties can be 
introduced by relaxing the classification criterion. 

Results 

Results with the data of Table 1 are shown in Figure 6 for the 

successive linkage version and in Figure 7 for the core assignment 


version. With the present data they give identical results in terms of 
the categories into which the objects are classified. 


RECIPROCITY LEVELS 
-à N w > 


o 


316364169 1 1 1314 1917 5 4 12 2 15 18 7 20 10 
OBJECT CODE NUMBERS 

& Я EN ela © 

ME tt cal Classification by Successive Linkages. (Reprinted 


LOUIS L. McQUITTY 625 


سس 


=== Association is so remote that classification 
is questionable, 


791963369 113% 7 


OBJECT CODE NUMBERS 


Figure 7. Hierarchical Classification by Core Assignments. (Reprinted from 
MeQuitty, 1970.) 


Desirable Features 


Hierarchical Classification by Multi-Level Reciprocity, as just 
outlined, has many desirable features, including: (1) setting the 
Classification criteria at successively higher and higher levels as 
Tequired by the data, (2) classifying from bottom up (a small 
initial criterion) or top down (a large initial criterion), (3) yielding 
the same results irrespective of the starting point, (4) analyzing 
fairly large matrices by pencil and paper, (5) portraying all classi- 
fication decisions and the objective basis for them, (6) reporting 
the internal consistency of the results, (7) excluding “misfits,” and 
(8) assigning objects either to central cores exclusively or to central 
Cores and their extensions as illustrated here. 


Summary 


The application of selected methods of pattern analysis to a par- 
ticularly difficult set of data illustrates strengths and weaknesses of 
the methods and serves as a basis for the development of a new 
Method which combines several desirable features and eliminates 
certain undesirable features. 
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A MEASURE OF THE AVERAGE INTERCORRELATION 


EDWARD E. CURETON 
University of Tennessee 


Karsrr (1968) gives a formula for the average intercorrelation 
based on the largest eigenvalue of the correlation matrix. It is 


Е— 1 
fom aa) 


where E is the largest eigenvalue and p is the number of variables. It 
occurred to the writer that a simple approximation to this formula 
might be based on the first centroid of the correlation matrix. 

Kaiser considers a correlation matrix R(p by p), with unities on 
the diagonal and equal values r elsewhere. Solving for r in terms 
of p and the largest eigenvalue of R, he obtains the formula given 
above. 

The column sums of R will each be 1 + (p — 1)r, and the total 
sum will be p times this value or p[1 + (p — 1)r]. Corresponding 
to the largest eigenvalue, the sum of squares of the first centroid 
factor loadings will be 
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Which is precisely Kaiser's formula with E replaced by Zf. 
Using Hotelling’s classical example, 


627 


68 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


2.043 1.729 1.797 1.707 7.336 


we find that 


20а)? _ 13.515 _‏ ے 
Df = зу = 7336 7 15%,‏ 


from which r = .281, Kaiser finds for this same problem that f = .282, 
and the still simpler arithmetic mean of the off-diagonal entries is .278. 
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SELF-CLAIMED AND TESTED KNOWLEDGE! 


RALPH F. BERDIE 


Student Life Studies 
University of Minnesota 


Tue traditional method for observing whether or not a person 
knows something is to develop a test or examination which provides 
an opportunity for him to demonstrate his knowledge. The person is 
asked to indicate, through recall or recognition, the answer to a ques- 
tion and the person asking the question then decides regarding the 
correctness or appropriateness of the answer. The decision regarding 
the person’s knowledge is a function of what he actually knows, 
the way the question is asked, and the way the judgment is made 
concerning the correctness of the answer. The assumption is made 
that if a question is asked of a person and he answers the question 
in the proper way, he has the information, and if he does not answer 
the question properly, he does not have the information. 

An alternative method to determine whether or not a person 
knows something is to ask him if he knows it. Thus one can present 
an individual with a list of laws, principles, persons, or facts and 
ask the person to check or otherwise indicate the ones with which 
he is familiar and the degree of his familiarity. 

Many people do not like to take tests and become frightened or 
anxious when faced by a test. Tests also can demand much time to 
develop, administer, and score. On the other hand, few people are 
reluctant to tell another that they do or do not know something and 
the collection of such information can be done fairly quickly and 
economically. 

The motivation of the respondent and his perception of the rea- 


—— 
та The author acknowledges the assistance of Mr. Gary R. Hanson in develop- 
€ the test, collecting the data, and. analyzing the results. 
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son for which he is being questioned most likely effect his responses, 
regardless of their mode. Revealing the extent, of ones information 
to a professor offering pellets of high grades may be quite different 
from responding for a graduate student collecting dissertation data. 

The author, in an attempt to observe the reported experiences of 
students, both before and after entering college, developed a set of 
experience inventories and asked students to describe their experi- 
ences (private musie lessons, attendance at lectures and concerts, 
membership in youth organizations, and so on). As part of this ex- 
perience inventory, three separate lists of relatively well-known 
persons were devised, a list of authors, & list of painters, and a list 
of other publie figures, including businessmen, politicians, entertain- 
ers, and athletes. Students were asked to indicate that they had 
never heard of the person, that they had heard of the person but had 
no other experience regarding him, or that they had read a book by 
the person, seen a picture painted by the person, or knew who the 
person was. The responses of students in different colleges were 
different, and the responses of students before and after two years 
of college were different. Thus the method apparently reflected dif- 
ferential experiences. 

The question still remained, however, as to the correspondence 
that existed between what students said they knew and what they 


actually knew, as shown by more traditional achievement exam- 
ination. 


Method 


From the experience inventory a list of 12 persons well-known in 
public life, 13 well-known authors, and 14 well-known painters was 
prepared. The names included were: Bishop Pike, Henry Miller, 
John Gardner, Melina Mercouri, Van Cliburn, Thomas Watson, 
James Conant, Shirley Booth, Werner Von Braun, Francis Spellman, 
Stewart Udall, Lorne Green, Albert Camus, Fyodor Dostoyevski, 
Ayn Rand, Henry James, James Baldwin, D. H. Lawrence, Leo Tol- 
stoi, William Golding, J. D. Salinger, Jean Paul Sartre, J ү Joyce, 
Francois-Marie Voltaire, James Michener, Edgar-Hilaire Degas, 
Henri de Toulouse-Lautrec, Peter Paul Rubens, Grant Wood, Velas- 
quez, El Greco, Botticelli, Manet, Andy Warhol, Cezanne Salvador 
Dali, Jackson Pollack, Vincent Van Gogh, and Бор. ; 

For the well-known men, students were asked to respond in one of 
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three ways: know who he is, have heard of him but cannot identify 
him, have never heard of him. The responses for the authors con- 
sisted of: read a book by him, heard of him but have not read a 
book by him, have never heard of him. The responses for painters 
consisted of: seen a picture by him, heard of him but have not seen 
a picture by him, have never heard of him. Students were instructed 
to place check marks in the appropriate positions and the purpose of 
the research presented before they completed the checklist was de- 
scribed as to determine how much students knew of these things. 

The achievement examination consisted of 40 items and the stem 
of each item included the names of the well-known persons. Five 
alternatives were presented for each item. Examples are: Bishop 
Pike is known for his experiences with (a) LSD (b) hypnosis (c) 
group meditation (d) speaking with the dead (e) anxiety per- 
ception; Which of these novels was written by Albert Camus? (a) 
Pere Goriot (b) Walden Two (c) One Flew Over the Cuckoo’s Nest 
(d) The Great Gatsby (e) The Stranger; Salvador Dali painted (a) 
“The Last Supper" (b) “Young Beggar” (c) “The Steamship" (d) 
“View of Toledo” (e) “Titus.” 

Total scores were obtained for the checklist and for the test, and 
three additional subscores were obtained for each instrument, one 
based on artists, one based on authors, and one based on public 
figures. 

Two samples were studied. One contained 84 males and 80 females, 
mostly sophomores, who were drawn from the subject pool of the 
second semester of general psychology in the spring quarter of 1969. 
Each of these subjects received two “Course grade points” for par- 
ticipating in the research. The second sample consisted of 17 males 
and 35 females who lived in a coeducational freshman dormitory 
and who volunteered to take these tests in order to receive one dol- 
lar for their effort. 

The instructions for each instrument were printed on the first page. 
The experience checklist was distributed and subjects were in- 
structed to read the sheet of instructions and begin immediately. No 
time limit was set and most subjects finished in 7 to 10 minutes. 
After the checklist was completed and before the test was given to 
the students, they were told that the research was designed to com- 
pare the two methods of observing what students know and that 
after they had completed the test the experimenters would com- 
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pare their responses to the checklist to the answers they provided on 
the test. Subjects then were told to read the printed test instructions 
and begin immediately. Most subjects required between 15 and 25 
minutes to complete the test. 

Within both samples, analyses were completed separately for men 
and women. The experience checklist was scored by assigning a 
weight of three to the category of knowing who the person was, a 
weight of two to the category of having heard of the person, and a 
weight of one to the category of never having heard of the person. 
The total score for the checklist consisted of the sum of all of these 
items. The score for the achievement test consisted of the number of 
items answered correctly. 

The number of subjects who checked each category on the check- 
list was determined, the number of subjects who checked each cate- 
gory on the checklist and also provided the correct answer for the 
corresponding item on the achievement test was observed, and the 
per cent of individuals who checked each category and also checked 
the correct answer was determined. The product-moment correlation 
coefficients were computed between the total scores and for the three 
subtest scores for each sample. 

Students’ responses to the test items cannot provide an absolute 
indication of their knowledge or lack of knowledge about the person. 
Subjects were asked to identify the name of a person with one of 
several possible facts about that person and some of the subjects 
might have known much about a person but not known the fact pre- 
sented. For example, the alternatives presented for the novelist, 
James Michener, included the names of four books that Michener 
did not write and the title of a book he did write, Tales of the South 
P acific. A student might not have known that Michener wrote that 
Partiaulae book, but he may have read another book by him and 
have known quite a lot about him. "Thus, the test gives only one of 


several possible indications as to the students’ knowledge regarding 
these people. 


Results 


Table 1 presents the intercorrelations between the test scores and 
the checklist scores, 


The correlations between total Scores on the test and the checklist 
ranged from .47 to .74. The correlations for the three largest samples 
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TABLE 1 
Intercorrelations between Scores on the Test and Scores on the Checklist for the Four 
Samples 
Public 
Artists Authors Figures Total 
N (14 items) (13 items) (12 items) (39 items) 
CLA males 84 .07 44 .76 .74 
CLA females 80 —.05 .69 .67 .67 
Dormitory 
males 17 —.07 .40 ‚72 47 
Dormitory 
females 35 —.08 .30 ‚40 .65 


eee ee ee 


were .65 and above. These three correlations for the total scores are 
statistically significant beyond the .01 level, the correlation for the 
smallest group is significant beyond the .05. The correspondence be- 
tween the two measures is more than faintly observable. 

The subtest correlations are greatest for the public figures and 
nonexistent for the artists. This may be due in part to the relatively 
greater experience students have with names of public figures and 
the little experience they have in the field of fine arts. 

The results on Table 1 suggest that a checklist provides a rough 
but acceptable means for determining how much students, as а 
group, know about some things but not about others. 

The next analysis was of relationships between responses to in- 
dividual items on the checklist and corresponding items on the 
test. For example, of the 80 women in the psychology pool sample, 
43 said they knew who Bishop Pike was, 28 said they had heard of 
him but could not identify him further, and 9 said they had never 
heard of him. Of the 43 who said they knew who he was, 65 per cent 
answered correctly the test item regarding Pike. Of the 28 who said 
they had heard of him, 43 per cent answered the test item correctly. 
Of the 9 who said they had never heard of him, 11 per cent answered 
the item correctly. Chance alone would provide that 20 per cent would 
answer the item correctly insofar as there were five alternative an- 
swers. For each of the 39 items, the percentage of students who said 
they knew the items and who answered the test item correctly was 
determined. Then of the students who indicated they had heard the 
item but had no further experience with it, the percentage who 
answered the test item was observed. Finally, of the students who 
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indicated they did not know the item, the percentage who answered 
correctly the test item was determined. 

This analysis first was done for the 12 items regarding publie 
figures. For the 80 women, the percentages indicating on the checklist 
that they knew who these people were ranged from 10 to 76, with 
Lorne Green being the best known person. The percentages indicat- 
ing they had heard of these people but knew no more about them 
ranged from 1 to 28. The percentages indicated they had never heard 
of these people ranged from 2 to 53. Conant was the least known 
person. For this group of women, and considering only the items 
pertaining to publie figures, the median percentage correct on the 
test was 56. Of the students who checked they knew who these 
people were, the median percentage correct on the test was 85. For 
the people who checked that they were acquainted with the names of 
these people but could not identify them, the median percentage cor- 
rect on the test was 53. For the people who said they did not know 
who they were, the median percentage correct was 18. 

For the authors’ item, the median percentage correct on the test 
was 06. Considering only those persons who said they had read books 
by these authors, the median percentage correct on the test was 74. 
For those who indicated they knew the persons but had not read books 
by them, the median percentage correct on the test was 68. For the 
subjects who said they did not know who these authors were, the 
median percentage correct was 20, 

For the artists, the median percentage correct for the total group 
was 39. For the subjects who said they had seen pictures by these 
artists, the median percentage correct was 52. For the subjects who 
recognized the names but had not, seen pictures by these persons, the 
median percentage correct was 23, and for the subjects who said 
they did not recognize the names of the artists, the median per- 
centage correct was 22, 

oe for the subjects who described their knowledge as nil, the 
median percentage correct on the tests was chance. For the most 
part, people who said they knew more about the person also per- 
formed better on corresponding items, 
ке ея manting the validity of the method can be ob- 

Among the ist of names of tory ial experience inventory: 
der и ы... thirty authors were included the names 
› аз far as this writer knows, nonexistent 


а 
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and among the listed names of forty artists were included the names 
of four persons who were not known as artists. Among а group of 
. entering freshman men, from 0 to 3 per cent reported they had 
4 read books by the nonexisting authors and from 5 to 25 per cent 
indicated they had heard of these persons. From 71 to 93 per cent 
reported they had never heard of these persons; the remainder of the 
responses were unclassifiable. From 0 to 8 per cent of the students 
indieated they had seen pictures by the nonexistent painters and 
from 4 to 19 per cent reported they had heard of these. From 72 
to 96 per cent reported they had never heard of these nonexisting 
painters. Insofar as the students were under no external pressure to 
+ lie, the responses indicating experiences with the nonexistent per- 
sons might be attributed to erroneous associations with the names 
of existing persons. For example, one of the nonexisting author names 
was Samuel Green. Students might have associated this with 
someone like Samuel Butler or Graham Green. One of the names pre- 
sented was Gilbert Deck, a name that apparently does not resemble 
that of any well-known author, for only 1 per cent of the students 
reported they had read a book by him and only an additional 5 
per cent indicated they had actually heard of him. Similarly, the 
name of Paul Fondley, presented as an artist, was recognized by 
only 4 per cent of the students and none of the students claimed they 
had seen a picture by him. These sketchy results suggest that not 
much purposeful distortion was reflected by the experience inventory. 


Conclusion 


The results suggest that for survey purposes, asking people 
whether or not they possess information may provide a satisfactory 
Means for observing whether or not they know something. The 
effectiveness of the checklist method may be quite dependent on the 
content and also on the level of familiarity the subjects have with 

м the content. 

The results also suggest that if а person on a checklist indicates he 
does not know something, his lack of information will be verified 
by an achievement test. The checklist method may not be quite 
аз adequate in showing whether or not people who think they know 
something actually do know it. In а sense, we have here an excellent 

^ method for determining the extent of a person's ignorance, perhaps 
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а less satisfactory method for determining the extent of his knowl- 
edge. 


This conclusion has practical significance. Surveys of amount of 
knowledge and information possessed by members of a group 
ean be simplified by successive screening. Large numbers of persons 
ean be asked to indicate on a checklist what they know and then 
achievement tests can be given only to those claiming knowledge. The 
assumption is that those who say they do not know really are ignor- 
ant and further testing of them is unnecessary. How acceptable this 
assumption is depends in part on the motivation of respondants. 


А 


р 


f 
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A SIGNIFICANCE TEST FOR BISERIAL r* 


EDWARD ALF? лхо NORMAN ABRAHAMS 


Naval Personnel and Training Research Laboratory 
San Diego, California 92152 


Тив biserial correlation coefficient, ro, is a statistic often used in 
test construction and validation. A disadvantage of rp is that its exact 
sampling distribution in small samples is not known. Thus signifi- 
cance testing becomes a problem; and approximate methods suitable 
only for large samples or limited ranges of difficulty level or criterion 
dichotomization (p) and magnitude of ть must be used (see, e.g. 
Guilford, 1965, p. 319; Walker and Lev, 1953, p. 271). An advantage 
of the point-biserial correlation coefficient, rj, is that its exact 
sampling distribution is known. 

The present paper will show that under the null hypothesis p — 0, 
the assumptions of ть and rp» coincide. This makes it possible to 
derive exact, small sample significance tests for ту from the corres- 
ponding known distribution of rj». 

The point-biserial correlation coefficient is used to determine the 
relation between a dichotomous variable, X, and a continuous 
variable, Y. The development of туь assumes the continuous variable, 
Y, is normally distributed in each X category, and the variance of 
Y is the same in each X category. No assumption is made as to the 
distribution of the dichotomous variable, but generalization is made 
only to a universe of samples of size N having the same fixed num- 
ber of cases Np and Nq in the dichotomous categories (Walker and 
Lev, 1953, p. 271). 


——— 
lThe opinions expressed are those of the authors and do not necessarily 


» Teflect those of the Navy Department. 


? Also at San Diego State College. 
637 


Єз EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


For example, if we compute ry, on random samples of size 10 from 
some given population, where Np = 6 cases fall in one X category 
and Nq = 4 cases fall in the other X category, we would expect туу 
(df = 8) to exceed + 632 one time in twenty when р = 0. The value 
of .632 can be found in a table of significance values for r,» (see, e.g. 
Guilford, 1965, p. 580). 

The biserial correlation assumes we are sampling from a bi- 
variate normal population; and the two X categories are formed by 
а dichotomization of the X continuum. It is often true that the 
distribution of Y will not be the same in each X category under the 
assumptions of r, (Walker and Lev, 1953, p. 271). However, when 
p = 0, the distribution of Y will be independent of X. Therefore, 
the distribution of Y will be identical in each X category ; and if the 
(X, Y) distribution is bivariate normal, then Y will of necessity 
be normally distributed in each X category, and will have the same 
variance in each X category. Thus, when р = 0, the assumptions of 
тъ and rj are the same. 

For any given values of N, p and q, r, will be a constant times 
"т; that is: 


n= Cw) Yn, 


where y is the ordinate of the normal curve at the point of dichoto- 
misation, Therefore, it is possible to determine significant values 
for ry directly from the significant values of the corresponding ту». 

In the above example, for instance, where p = .6, we have: 


n = (rs) O = (1.268). 


Thus, if rp» = =. 032 is significant at the five per cent level, then 


= ~(.632) (1.268) = + $0] will also be significant at the five 
per cent level. 


а-в |Р 
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TABLE 1 
The 5 (Roman Type) and 1 (Italics) Per cent Significance Levels for ry 
Point of Dichotomization 
es 
.0 55 .60 .65 .70 .15 .80 .85 .90 .95 
(%) (45) (.40) (35) 0.30) (.2) (.20) (15) (10 (.05) 
об .048 .957 .972 ои — — en 3 = 
88 .888 ..890 .910 .931 980 — — — — 
.835 .838 .s45  .858 .878  .908  .92  — — = 
e — ES — — — — — — 
192 .794 .801 814 833 .861 7.003 %8 — — 
.968 .981  .969 .984 — — = -- — - 
755 .757 .763 775 794 i20 80 .922 — = 
981. .923  .938  .040 8 — — E — — 
722 .724 .730 .142 750  .785 | .803 13883 95  — 
.887 .890  .898 .91 9099  .900 — — — - 
693 .695 .701 712 729 .73 70 87 95 — 
857 .859 .867 .880 .901 931 M — -- — 
.007 .609 .675 .686 702 .725 .761 815 910 — 
[MO .59/ .899 ` .858 7.579 00 TENS EREE = = 
.044 646 .652 .662 Өт 70 . 737 89 — 
„803 .800  .818 88 845  .878  .918  .989 — — E 
.023 .625 .631 640 655 678 71 762 .850 مت‎ 
.780 .782 .789 80#  .890  .848 3 — = 
.004 .606 .611 621 635 1657  .089 18 .84 — 
759 .761 .768 780 798 85  .800 971 — — 
587.589 .594 603 617 638 vd 717 800 990 
И .74 7 759 777 ..808 ма жыш 
18 621 D 963 
783 — 
605 759 .938 
765 960 ay 
590 
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TABLE 1 (Continued) 


Point of Dichotomization 


Sam- 
Sin 
30 . 
2 
60 
100 
125 
150 
200 
250 
300 
350 
400 
500 
1000 


Note.—A dash (—) indicates rp would have to exceed 1.00 to be significant. 
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STATISTICAL CONTROL OF "IMPURITY" IN 
THE ESTIMATION OF TEST RELIABILITY 


K.H.LU 


Department of Biostatistics 
University of Oregon Dental School 
Portland, Oregon 


TRADITIONALLY, the reliability of a test is defined as the ratio of 
the variance of true test scores to the variance of the observed test 
scores, Numerous articles have been written on the concept of 
reliability and the ways of estimation. The three typical techniques 
of estimation are (1) the Kuder-Richardson formula 20 (Kuder and 
Richardson, 1937), (2) the Hoyt analysis of variance method (Hoyt, 
1941), and (3) the Rulon split half method (Rulon, 1939). It has 
been shown that the Kuder-Richardson formula 20 and Hoyt's 
method are algebraic equivalents. The reliabilities of many tests 
have been computed by these methods since their introduction some 
30 years ago. 

Unfortunately, from the theoretical point of view, each of these 
methods suffers from a certain amount of impurity. Consequently, 
these methods arrive at the correct estimates only when the im- 
purities are accidentally absent from the data. For instance, Hoyt 
and Krishnaiah (1960) investigated an analysis of variance model 
where the item-subject interaction was assumed absent. Under vari- 
ous assumptions regarding the nature of the effects of the items and 
the subjects, they found some impurity exists in this simple model. 

It is the purpose of this paper to: 

1. render precise definitions of the concepts of reliability for the 
single item, the whole test, and the relationship between them 
in the least-squares sense; 

2. elucidate the sources of impurities present in 
ods of estimation; 


the current meth- 
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3. suggest an estimating procedure such that the resultant re- 
liabilities are free from such impurities; and 
4. give the statistical definition of significant reliability which 
serves as the necessary condition for а "meaningful" reli- 
ability. 


The Definitions of Item and Test Reliabilities 


Suppose that the observed score of test item j by a subject S, is 
given by 


Vu = a +S, + еи = 1,2, ... т 


a, 1=1,2,...п ш), 
where a = the mean of true scores 
8, = the true deviation from for the ith subject, such that 
pup E 
=i 
би = an observational random error which is normally inde- 
pendently distributed (0, c^). 


We see that the true score of the ith subject is д + S, We shall 
define the reliability per item as the intraclass correlation. 


[4 
т = жүз (2) 
Now let us consider the problem of the definition of reliability 
for a test of n items. | 
Let 


1< 1 = 
vum > yum [me +n8; + уз eu] ; 
we have 


Yi. = р + S, + es, Ы (3) 
Suppose that д is known or effectively estimable; we wish to 


Ф = Ely. — u) — Sj 


is à minimum, 
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Expand 9, we have 


Ф = r Ely. — à)! — 2r, By. — н)(8) + ESY 
3 
(0+ т) - Drs" + с 
Differentiate Ф with respect to r, and set the derivative equal to zero, 
we have 
3 
=. orfe + Z) = 2-0 
and then 


(4) 


cs 
т, = —— = 
* :19 Гы з ай 
os + 


n 
We shall call the r, thus derived the reliability of the test. 

It is of interest to note that r, is the resultant reliability of 
applying the Spearman-Brown formula to the reliability per item, 
т. It can be shown as follows: 

From equation (2), we have 


= ст = Z z г 
пе Tu “le wl 
Therefore, 
es Tr 
E 


From (4), we have 
nos? ^ Cn, Ja? 
"Hog Ye 1105/0 FIR 
= т/(1 = т) = LE 7 PE 
n= rtin 1+0- Dr 


which is the familiar Spearman-Brown formula (Spesrman 1904). 
. Тһе test reliability as defined by equation (4) is totally 
consistent with the definition given by previous investigators, 


(5) 


Var (true test score) 


т == 


Var (observed test score) 
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The Impurity in the Current Estimation Methods 


Unfortunately, the computation procedures used in computing 
т. woth as the Koder-Riehardson 20, or its equivalents, the Hoyt 
analysis of variance method and the Rulon split-half method, are 
all inappropriate to some extent, so that they seldom truly approach 


м they should. In the narrative to follow, we shall discuss the in- 
appropriateness of these computation procedures. 

Let ws eoonider а test of n items be given to m subjects and the 
revalte summarised as two-factor factorial table, where the subjecta 
and the Heme have m and n levels respectively. 


TABLE 1 
Ramis of om sim Tes Token by m Swhjecte *z,, = кое of jth item by ith rubject 
————— BÓ JY madq CMM 


Mujeres 1 Ce ЭРҮ" 2 
———ÓÁÁ— U 
1 та nn LI LOS mn. 

з bd fu ЫЛ fe m. 

t za ta žu fe = 

^ fes pe . 

z fs LET M a = 
————MÁ م‎ а. 


The Kuder-Richardson 20 and the Hoyt method may be illustrated 
by the following mathematical model in analysis of variance: 
Zu a So tI +e, (6) 

Zu = the score of the ith subject on the jth item 

= 1 if answered correctly 

= 0 otherwise, 
B = an effect common to all z,,'s 
8, = а component specifically associated with the ith subject 

item 


K niv шш 
fred effect model 
this model, it is understood that only the particular item ia- 


This model assumes that (1) the items of the test are а random 
samplo of size n from a population of items; (2) the subjects are а 
random sample of size т from a population of subjects. The resulta 
аге to be inferred to the populations of items and subjects at large. 
While it is reasonable to assume the m subjects are a random sample, 
бот if ever is a test written which would select test items 


‘and over again with various groups of subjects, it is then anything 
but a random sample. The random effect model's assumption that 
the items are a random sample, therefore, exerts very strict restrie- 
tion on the usefulness of the random effect model. However, on 
occasions where the items can indeed be considered as а random 
sample, the random effect model of course can be used. 


In view of the foregoing discussion, а mixed-effect model would 
appear more suitable for our purpose. By a mixed-effect model, it is 
Meant that: (1) the items are considered fixed, and (2) the subjects 
аге а random sample from a population of subjects. However, 
difficulties in the correct estimation of reliability remains. This is 
due to the fact that the test results are considered as a two-factor 
(item and subject) factorial without direct estimate of the experi- 
Mental error о?. In the analysis to follow, we shall show the conse- 
“quences in each of these three models, resulting in impurities in 
. the estimation of reliability. 
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TABLE 2 
The Components of Variance for the Fized, the Random and the Mized Models 


MS. is Estimate of 


Variation df. M.S. Fixed Model Random Model Mixed Model 
Items n=l м, е + тей а? + ors? + тоң o + суз? + те} 
Subjects m-1 М, tnes oF + erg + nes? о? + поз? А 
1х8 (п—-1)\т—-1) M e + e о? + ors? с? + суз? 

Егтог 0 М, 6 Ld о? 

Total am — 1 

——————————À 


The partitions of mean squares of the analysis of variance ac- 
cording to the three models are given in Table 2. We also see that 
the error mean square is not estimable because it has zero degrees of 
freedom. According to Hoyt's formula, the reliability is calculated 
as 

п = Mi — M, 
M, 
Thus for the fixed and mixed models, 


ne 2 с, РЕ 
nr Te’ 
and for the random model, 
т. = р 
nos + ors +o 
Unless ors? = 0, the Kuder-Richardson formula 20 or Hoyt's 
methods will always under-estimate т. 


Rulon's formula is equivalent to the following analysis of variance 
table involving only the subject’s scores of the halves. Note that 


TABLE 3 
The Components of Variance for Rulon’s Split-Halves Model 
Sources of i 
Variation d.f. MS. E of € 
ШОТ CX c c estimate of: 
t 
Halves 1 Mi ars Зен? F E 
Subjects m-—1 Mi а? + nog? 
(H-8) т—1 M: at gens 
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there are (т х n) observations, hence mn — 1 degrees of freedom 
for analysis. But the Rulon's method only utilized 2m — 1 of them. 
From Table 3, we have 


ah M, — M, 
з 
nas" m 3 ens" 
- с? + nos 
Again it suffers from a theoretical “impurity” она? which tends to 
estimate the reliability incorrectly. 

In the event that ozs? = 0, then М» is in fact an estimate of o°, 
but it is estimated by (m — 1) degrees of freedom, sometimes may 
result in overestimating or underestimating reliability; whereas from 
the available data, a more efficient estimation of error mean square 
by m (n/2 — 1) degrees of freedom is available but not utilized. 
This inefficient use of available data should be avoided. Е 

From the above analysis, it is abundantly clear the mixed effect 
model would serve our purpose if it can be made to provide an 
estimate of the experimental error o?. In the narrative to follow we 


shall show such a model. 


The Appropriate Design and Model for the Estimation of Reli- 
ability 

In view of the foregoing analysis, it becomes apparent that in 
order to obtain the proper estimate of reliability, two salient points 
must be taken into consideration for the construction of a mathe- 
matical model: (1) the inclusion of the item-subject interaction 
term and (2) The direct estimation of о? and subsequent estimation 
of e. Guttman (1945) demonstrated that in order to estimate the 
Teliability coefficient, at least two tests are needed, thus leading to 
the procedure of the test-retest case. In our present case, we seek а 
method of estimation where only one test is required. In order to 
satisfy these requirements, one must in some way partition the test 
into at least two “comparable” parts in order to provide an indepen- 
dent estimate of o?. There are various ways of partitioning the test 
into two parts, for example it may be done by item contents or 
item difficulty. While it is difficult to pair items by “comparable” 
contents, it is a simple matter to pair them by difficulty from the 
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results of the test, since the error variance o? is a function of item 
difficulty (Lord, 1957). In fact, if the partition is done on the diffi- 
culty basis, it results in a reduction of the size of error variance o? 
(Osborn, 1969). Let a sample of m subjects be tested by a test of n 
different items, (we should require that n be an even number), in 
а manner such that if the item is answered correctly, the subject 
receives one point, and if answered incorrectly, the subject receives 
nothing. We shall arrange the n items in a descending order accord- 
ing to their respective numbers of correct answers, the item with the 
largest number of correct answers is listed first, the item with the 
second largest number of correct answers next, and the item with 
the least number of correct answers last. The list thus obtained is 
then a list of the n items by degree of difficulty in an ascending 
order. 

Since the degree of difficulty is monotonic in nature as appeared in 
the list, we may systematically assign all the odd-numbered items to 
the first half and all the even-numbered items to the second half of 
the test. Thus, we have n/2 pairs of items with one member of 
each pair in each half of the test. It appears reasonable to view the 
halves thus obtained as being comparable in the sense that each 
item in one half has a counterpart in the other with comparable 
degree of difficulty. (The systematic assignment will give the odd 
half perhaps a greater mean than the even half, but would not affect 
the variances of the halves.) The test results can be considered as 
being stratified in pairs according to item difficulty. As we shall see 
later, this stratification enables us to obtain an estimate of reli- 
ability without the entanglement present in the current methods. 

Let the model of the test be defined as follows: 


Vin =u + Hi + S, + (SH)s + I; + 0S), + ei 
where i = 1, 2, +++ m; j = 1,2, --+ n/2; k = 1, 2, and 


Ji = the score of the ith subject on the jth item in the kth half 
= lif answered correctly 


= 0 otherwise 
и = a common effect shared by all у; „28 
H, = a specific effect associated with the kth half; the halves 


are fixed effects such as a result of systematic designation 
such that 22: RO systematic design. 


1 
$ 
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S, = a component specifically associated with the ith subject. 

The S,’s are assumed normally independently distributed 

with mean zero and variance as’. 

a specific effect associated with the jth item pair. The 

I,’s are purposefully chosen items, therefore, considered 

fixed and >, I; = 0. 

HS, = a component associated with the specific combination of 
the ith subject and the kth half such that Ут. HSa = 0 
Note that we do not require Yi. H8, = 0 

SI; = а component associated with the specific combination of 
the ith subject and the jth item pair, such that bores Sle 
0; again we do not require Lro 514 = 0 

é = a random error normally and independently distributed 

with mean zero and variance e. 


I, 


The analysis of variance and the components of the mean squares 
are listed in Table 4. The test of significance of reliability is the 


H.: св? = 0 by 
F = M,/M, with (m — 1) and m(n/2 — 1) degrees of freedom. 
The computation of reliability for the test is 


TABLE 4 
The Components of Variance for Mized Model with Items Stratified According to 
E e 

Sources of Mean Square is 

Variation d.f. MS. Estimate of: 

mn 
Halves 1 м, а + ens! +g 
Subjects LESE Ms с? + поз? 
п 

HS m— 1 M Е gras 
Item-pairs 3-1 м, ot + 2ors + MF 
E (-0g-0 M e! 2ors* 
^d m = 1 Ms M 


Total mn — 1 
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nos M, an vem М, 1 
, 


py == 


nos + c M, 
and for the reliability per item is 


di L^ = M, — М, 
r +o М, + Му — 1) ` 


Significant Reliability vs Meaningful Reliability 
By the definitions of reliability, 


„= „шыш and т = ا‎ 

* nos T g А Cs + с 

we see that unless we are reasonably certain that os? > 0, the re- 
liability must be deemed as essentially zero. In order to assert that 
ез? > 0, we must rely upon the rejection of the hypothesis that os? 
= 0. The F test 


wo М, $ с? + nos? 
M, с? 

with (m — 1) and m(n/2 — 1) degrees of freedom would serve this 
purpose. We shall at the point define the term, significant reliability 
(re or ту) as a reliability estimate based on data analysis where the 
F test concerning the H,: os’ = 0 has been significant. A non- 
significant Р value would suggest the need of а new test. For & 
reliability estimate to be meaningful, the requirement of being 
significant would serve as a necessary condition, though just how 
high the reliability has to be in order to be meaningful depends 
very much on the purpose of the test. 


A Numerical Example 


In Table 5 are the item and subjects scores of a 12-item test by 10 


subjects from Guilford’s book Psychometric Methods (Guilford, 
1954). 


The above example is chosen for its wide accessibility since the 
Guilford book is a standard reference for workers in this area. 
(1) Using Kuder-Richardson formula 20, we find 


Ура = 2.03, сё = 9.45, n = 12 


n C = ae x ) 


m 
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тәй way о} Duipiooo y рәбитмү ота YNN spofqng ОТ Aq 18911, Wag D fo 821025 10101, puo 821025 шә} 
S d IG VL 


10 = фо o=o дю = d=“ 99 -'И 
d= 0% 60° 9r Т uz  *v Se Fen vc тс 1607 60 10! 1d 
d= 7F 10° 70° 60° — 80 01” 92° oe 90° oF. I8 18%. OF 24 
145 = с°9 E 2 g on сро ЖЕ бй 9- (224 6 6®р01 td 
+С = 9 6 8 L L 9 s 7 7 8 I Tk 0 mM 
= 

“х5 = 215 X= 99 1 [4 £ £ 7 E 9 9 L 6 6 Ot "I 
FFI а 1 т т т 1 I т I I I ITE от 
IGI т 0 т ї I ii 1 I I I 1 TEE 6 
18 6 0 0 H I I Т I 0 1 I e A 8 
6? $ 0 0 0 0 0 I I I T 1 TE E L 
98 9 0 0 0 0 0 I I r 0 I I ur 9 
% g 0 0 0 0 0 0 0 I т т па g 
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91 7 0 0 0 0 0 0 0 0 т I Ti AT © 
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(2) Using Hoyt's analysis of variance method, we have 


Sources d.f. 5.8. M.S. 
Subjects 9 7.875 875 


Items 11 9.492 ‚863 
Егтог 99 12.425 126 
Total 119 29792 
875 — .126 
Am TEXTO 857. 
(3) Using Rulon's split-half method: 


The split of the test items are 1, 3, 5, 7, 9, and 11 as the first 
half, and 2, 4, 6, 8, 10, and 12 as the second half as was done by 
Guilford; the computation is shown in Table 6. 

From Table 6, we have ga? = 1.05, o? = 9.45 Thus, according to 
Rulon's formula, 


TABLE 6 
E EO Men Amm for mim Method of 
Estimating Reliabil 
(Xo — X.) Xo +X. 
х, х, "oq js NET. 
2 0 +2 4 2 4 
2 2 0 0 4 16 
2 2 0 0 4 16 
2 3 -1 1 5 25 
3 2 +1 1 5 25 Ж 
= ehel O O O 2 
4 2 +2 4 6 36 M 
i 3 +1 1 7 49 
5 em 1 9 81 
б 5 +1 1 11 121 
6 0 0 12 144 
2 35 30 T5 13 65 517 
145 120 
2X? =X? zd ЕХ, ЕХ 
M3.5 3.0 +0.5 
о? 2.25 3.00 1.05 0 24 
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ej 1.05 _ 
" = ДА ИШ .889. 


The analysis of variance is presented in Table 7 We first test the 


TABLE 7 


Analysis of Variance and Components of Variance of the Mixed Model with Stratifi- 
calion According to Item Difficulty 


——————M—ÁÓÉ— 


Sources of Mean Square is 
Variation d.f. 8.8. MS. Estimate of 
Halves 1 .2077 (Mo) .2077 
Subjects 9 7.8750 (Mj) .8750 о + 1203" 
H:S 9 .8750 (My) .0972 e? + brys’ 
I 5 9.1417 (Mi) 1.8283 
1-8 45 6.2750 — (Mi) .1344 
Error 50 5.4173 (М) .1083 «+ 
Total 119 29.7917 


We accept Ho: ons? = 0. Note that if we compute 


M,— М, .8750 — .0972 T 
«ocn т 8880 
which is identical to Rulon's formula. In the present case, based on 
9 degrees of freedom and the acceptance of ons? = 0, the estimate 
of o? is .0972. However, the most efficient estimate of оз is the error 
mean square o? — .1083 based on 50 degrees of freedom. 
Thus we estimate 


Ti = 


М, — M, _ 8750 — -1083 _ $762 
Ment sr E T n ge 
and also 

og = 3150 = 1083 _ 0639. 
Thus reliability per item is 

0639 
ра RO BIE 

7: = 0639 + .1083 п 

Wenow list the results for comparison purposes: 
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Methods т, 
Kuder-Richardson 20 .8570 
Hoyt's .8570 
Rulon's .8889 
Mixed-effect model .8762 


From the above demonstration, we see the underestimate of r, by 
the Kuder-Richardson 20 and the Hoyt method. The overestimation 
of т, by Rulon's method is due to the small degrees of freedom in the ` 
estimation of o°. The mixed-effect model estimate of т; thus calculated 
is free from the impurities. 

One of the applications of r, aside from its meaning as a measure 
of reliability is its use in estimating individual subject’s true scores. 


у= р + 8; 


= р ri (us Ti и) 
Therefore, 


hi. = 9 Tni —g 


T = #1 b т) Gy Tifi 
For example, y = 6.50, r; = .8762, the estimated true scores of the 


ten subjects are: 2.56, 4.31, 4.31, 5.19, 5.19, 6.06, 6.94, 8.69, 10.44 and 
1132. 
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IS THERE AN OPTIMAL NUMBER OF ALTERNATIVES 
FOR LIKERT SCALE ITEMS? STUDY I: RELIABILITY 
AND VALIDITY 


MICHAEL 8. MATELL! Ax» JACOB JACOBY 
Purdue University 


Given that rating scales are so widely used in the social sciences, 
both as research tools and in practical applications, determination 
of the optimal number of rating categories becomes an important 
consideration in the construction of such scales. As Garner (1960) 
pointed out, the basic question is whether for any given rating in- 
strument there is an optimum number of rating categories, or at 
least a number of rating categories beyond which there is no further 
improvement in discrimination of the rated items. Garner and Hake 
(1951), Guilford (1954), and Komorita and Graham (1965) in- 
dicated that if we use too few rating categories, our scale is obviously 
а coarse one, and we lose much of the discriminative powers of which 
the raters are capable. Conversely, we could also grade a scale so 
finely that it is beyond the rater’s limited powers of discrimination. 

Ghiselli (1948) and Guilford (1954) contended that the optimal 
number of steps is a matter for empirical determination in any 
situation, and suggested that there is a wide range of variation in 
refinement around the optimal point in which reliability changes 
very little. Guilford felt that it may be advisable in some favorable 
situations to use up to 25 scale deviations. Ghiselli suggested that 
either reliability of measurement or ease of rating be used as a basis 
for the empirical determination of the optimal number of steps. 
Factors which affect the optimal number of rating categories, or at 
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least the number beyond which there will be no further improvement 
in discrimination, are, according to Garner (1960), clearly a func- 
tion of the amount of diseriminability inherent in the items being 
rated. He suggested that there can be no single number of categories 
appropriate for all rating situations. 

Champney and Marshall (1939) reported that under favorable 
rating conditions the practice of limiting rating scales to five or 
seven points may often give inexcusably inaccurate results. They 
suggested that the optimal number of steps is a function of the 
conditions of measurement. They also considered that, unless it 
could be shown that for a particular task either accuracy is not 
desirable or diserimination beyond seven points cannot be attained, 
it may be appropriate to use 18- to 24-step rating scales. Both 
Champney and Marshall and Guilford suggested that when the rater 
is trained and interested, the optimal number of steps may be in 
the 20-point range. The literature, however, contains few descrip- 
tions of scales employing such large numbers of rating categories. 

Jahoda, Deutsch, and Cook (1951) and Ferguson (1941) opined 
that the reliability of a scale increases, within limits, as the number 
of possible alternative responses is increased. Cronbach (1950) sug- 
gested that there is no merit to increasing the reliability of an 
instrument unless its validity is also increased at least propor- 
tionately. He concluded that “it is an open question whether & 
finer scale of judgment gives either a more valid ranking of subjects 
according to belief, or scores more saturated with valid variance 
(p. 22)." Earlier, Symonds (1924), in contrast to Cronbach, con- 
tended that the problem of determining the number of steps to 
utilize is primarily one of reliability. He implied that optimal 
reliability is obtained with a 7-point scale. If more than seven steps 
are utilized, increases in reliability would be so small that it would 
not pay for the extra effort involved. However, if the raters are un- 
trained or relatively disinterested, maximal reliability will be 
reached with fewer steps. Champney and Marshall (1939) sug- 
Sty ee eee ОЕ 
sults of empirical investi ibis ds 2 еш pé 

om gations by Bendig (1954) and Komorita 
(1963) indicated that reliability is independent of the number of 
scale points employed. Komorita concluded that utilization of a 
dichotomous scale would not significantly decrease the reliability 
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of the information obtained when compared to that obtained from 
a multi-step scale. 

Whether an increase in the number of scale points is associated 
with an increase in reliability, and how many scale points should be 
employed beyond which there would be on further meaningful in- 
crease in reliability, are both empirical questions. Studies addressed 
to these reliability questions have typically employed a measure of 
internal consistency (either split-half stepped up by the Spearman- 
Brown Prophecy Formula or Kuder-Richardson Formula 20). Utili- 
zation of a stability (test-retest) measure appears to be nonexistent. 
It should be apparent that both reliability coefficients—internal 
consistency and stability—must be assessed if meaningful and 
complete answers to the questions posed are to be provided. 

Moreover, studies dealing with the number of alternatives prob- 
lem emphasize reliability as the major, and in some instances, only 
criterion in the choice of the number of scale points. However, accord- 
ing to both Cronbach (1950) and Komorita and Graham (1965), 
the ultimate criterion is the effect a change in the number of 
scale points has on the validity of the scale. An intensive literature 
search failed to reveal any empirical investigation addressed to 
this question. 

Multi-step Likert-type rating scales provide two components of 
information—the direction and the intensity of an individual's 
attitudinal composition. Peabody (1962) concluded that the total 
scores obtained with any Likert-type scale represent, primarily the 
directional component, and only to a minor degree, the intensity 
component, Both Peabody (1962) and Cronbach (1950) suggested 
that differences in the intensity component primarily represent dif- 
ferences in response set tendencies, i.e., tendencies for subjects to use 
а particular degree of agreement or disagreement toward any atti- 
tudinal object regardless of the direction. Cronbach concluded 
that any increase in test reliability due to response set, in the final 
analysis, dilutes the test results and lowers its validity. 

This investigation was undertaken to answer & fundamental 
deceptively simple question: is there an optimal number of alterna- 
tives to use in the construction of a Likert-type scale? Of specific 
concern was whether variations in the number of scale alternatives 
affected either reliability or validity. 


and 
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Method 

Subjects 

Four-hundred and ten undergraduate psychology students ene 
rolled in a large midwestern university participated in this experi« 
ment. The procedure first involved selecting adjective statements for 
each seale point (n = 40), then determining the inter-rater reliabil- 
ity on those statements selected (n = 10), and, lastly, conducting 
the experiment proper (л = 360) in which 20 subjects were as 
signed to each of the 18 different Likert scale formats. Different 
samples of students attending classes in general introductory psy: 
chology, introductory applied psychology, industrial psychology, 
and consumer psychology were used for each segment of the study, 


Scale Construction and Instruments 


finitely agree,” in paired comparison format, and asked to select 
the statement from each pair “which indicated greater agreement.” 
A total of 136 comparisons were made by each subject. (There is 
по reason to believe that the results would have been any diffe ent 
had the instructions specified disagreement rather than agreement.) 
The information derived from this procedure served as the basis for 
selecting those statements used to construct the 18 (ie., 2- to 19- 
point) Likert-type rating formats. Criteria for the selection 
а statement were: (a) that it have a minimal number of revers 
(less than five out of a possible 40), and (b) that it be appro 
mately equidistant (where possible) from the statement preced 
and following it. Ten of the original 17 statements came closest 10 
meeting those criteria. These 10 statements were then presented to. à 
чы s of 10 Wei who were instructed to rank them in the 
order of increasing isagreement. (The purpose of the disagreem 
instructions with these subjects, in Жаы to the reir 
structions given the earlier 40 Subjects, was simply to insure 
the relative intensity of the descriptive adjectives remained i 
variant, i.e., was unaffected by the direction of the statement.) Ап 
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average rank-order correlation coefficient was then computed to 
determine the inter-rater reliability. 

The instrument used in the experiment proper was a modified 
Allport-Vernon-Lindzey Scale of Values (1960), containing 60 
items. Eighteen different versions, in which the number of alterna- 
tives for each item ranged from a 2-point to a 19-point format, 
were constructed, using the ten adjective descriptors obtained in the 
first part of the study. The criterion for the construction of each 
format was that each scale point be approximately equidistant from 
the ones preceding and following it. 


Procedure 


The experimenter entered the testing room and proceeded to dis- 
tribute the rating booklets. Arrangement of the booklets was in such 
an order that the first subject received a 2-point rating scale, the 
second received a 3-point rating scale, and so on, until the eigh- 
teenth subject received a 19-point rating scale booklet. This pro- 
cedure was repeated until all subjects had obtained rating booklets. 
For test-retest purposes, subjects were asked to record their names, 
course name and number, time and place of meeting, and instructor's 
name on top of their rating booklets. The subjects were then in- 
structed to open their booklets, read the instructions, record the time, 
rate the 60 statements, and then record the time at completion of the 
task. The rating instructions were the same for all the booklets, 
except that every block of 20 subjects used a different scale to rate 
the statements. Subjects did not know they were using different 
rating scales, 

After completing the modified Study of Values, the subjects pro- 
ceeded to fill out an attached criterion measure. Statements in the 
criterion measure explicitly spelled out what each subscale on the 
Study of Values was designed to measure, as defined by its test 
manual. Using a graphic rating scale, each subject was asked to rate 
the present importance of each of the six value areas in his life. 

Three weeks after the first administration, and with the assis- 
tance of the identification data provided at the first session, each 
subject was contacted and received another rating booklet identical 
to the first. Upon completion, the purpose of the experiment was 
explained and questions were answered. 

Data obtained from the premeasure were analyzed to determine 
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the internal consistency reliability (Cronbach's alpha, 1951) and 
concurrent validity. Both measures, pre- and post, were used to 
assess the test-retest reliability, predictive validity, and the reliabil- 
ity of the criterion measure for attentuation-correction purposes. 

A Fisher Z transformation (Fisher, 1921) was undertaken to con- 
vert all reliability and validity coefficients in order to insure nor- 
mality. These transformations were then analyzed by a single 
classification analysis of variance procedure to determine whether 
there were significant differences in reliability and validity as a 
function of rating format. Each of these analyses was segmented 
by the six value areas in the modified Study of Values. 

Following data collection, the responses to each item of the modi- 
fied Study of Values were converted to dichotomized or trichoto- 
mous measures. All even-numbered formats were dichotomized at 
the center. Responses to the left of center were scored “agree,” while 
those to the right were designated “disagree.” The odd-numbered 
formats were trichotomized, yielding the categories of “agree,” “un- 
certain,” and “disagree.” The resultant reliability and validity 
coefficients were then determined for each original and collapsed 
rating format and subsequently transformed into Fisher Z's. The 
standard error of the difference between the original and collapsed 
set of Z's was computed and then divided into the difference between 
the original and reduced Z coefficients. This procedure, a critical 
ratio, allowed us to determine whether the original correlations were 
significantly different from those obtained by collapsing these 
many-stepped formats to dichotomous or trichotomous measures. 


Results 


Table 1 summarizes the results of the adjective selection proce- 
dure and presents for each statement the proportion of “greater 
agreement” judgments made by the subjects, Employing the cri- 
teria of minimal reversals and approximate equidistance from pre- 
ceding and succeeding statements, the 10 statements finally selected, 
together with their scale value, are graphically presented in 
Figure 1. To ascertain the consistency (inter-rater reliability) 
with which these statements were tanked, 10 additional subjects 
proceeded to rank them. An average ra 


А : nk-order correlation coeffi- 
cient of .99 was obtained, indicating a 


n extremely high degree of 
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ТАВГЕ 1 
Scale Values of the Intensity Ratings for the Original Set of Statements 


Proportion of “greater 


Statement agreement” 
I am uncertain .00 
Iam uncertain, but probably agree .08 
I hardly agree 17 
Iscarcely agree .20 
I minutely agree -22 
I vaguely agree -29 
I barely agree 30 
Islightly agree 41 
I moderately agree -45 
I pretty much agree .53 
I strongly agree .63 
I intensely agree 74 
І immensely agree -76 
І extremely agree -76 
I absolutely agree -92 
I infinitely agree 94 
I unlimitedly agree .94 


c-nannecy agro, ,  —. шш Бе ЖЕЕ MM T ns 


agreement among raters as to the rank associated with each state- 
ment, 

Tables 2 through 5 present the internal-consistency reliability, test- 
retest reliability, concurrent validity, and predictive validity co- 
efficients (the latter two corrected for criterion attenuation) for 
each of the 18 rating formats hexacotimized by each of the Allport- 
Vernon-Lindzey value areas. Table 6 presents the results of analyses 


Strongly agree 
Intensely agree 
Infinitely agree 


Uncertaín 
Uncertain, but probably agree 


[— Hardly agree 
Barely agree 
Moderately agree 
Pretty much agree 
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Proportion of "greater agreement" judgments 


Figure 1. Graphic representation of the scale values of the selected statements. 
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of variance computed for each value area to assess the extent to 
which there was a relationship between the rating formats and the 
reliability and validity measures. This table displays the F ratio for 
each criterion and value area, indicating whether the relationship 
found was significant and, if so, to what, extent. Examination of 
Tables 2 through 6, as well as visual inspection of graphs charted 
from the data contained in these tables, reveals that there is no 
systematic relationship between predictive validity, concurrent val- 
idity, internal-consistency reliability, and test-retest reliability and 
the number of steps in a Likert-type rating scale. This lack of a 
systematic relationship was replicated for each of the six value 
оа -imat in the modified Allport-Vernon-Lindzey Study of 
Table 7 presents the reliability and validity vectors for the 18 
original and collapsed rating formats, Figures 2, 3, and 4 graphically 
display the test-retest reliability, concurrent validity, and predictive 
validity coefficients for the original and reduced rating formats, re- 
spectively. It is apparent that a large degree of overlap exists among 
each of the three pairs of figures. There appears to be only minimal 
differences between the reliability and validity vectors based upon 
the original rating formats and those obtained by collapsing these 
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TABLE 3 
Reliability Coefücients for Each Rating Format Hezacotomised by Value Area 


Political Economic Aesthetic Religious Social 


8 
КСЕЕЕЕГТ ЕЕЕ; 
LELEEEEEERELETELETE 
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ТАВІЕ 4 
Concurrent Validity Coeficients for Each Rating Format Hezacotomized by Value Area 


Concurrent Validity Coefficients 
Format Theoretical Political _ Economic Aesthetic 


E 10 .16* .01 .02* .03 .04* .08 .09* .11 .M* 43 .62* 
> 103 .05  .70 .89 451  .28 .35 62.67 .46 .58 
4 197 .40 э .20 47 .53 .32 .51  .63 6 .48 .58 
5 105 .13  .07 .08  .37 .39 45 .54 .86 .87  .52 .00 
6 44 .50 08 .00 -62 .66  .67 .82 | 00 .78 .19 .26 
MS C40 .59 103: 108.) 14 :1897568/076 09 A LO CD 1.34 
8 43 .52 87 .60 .65 .75 .40 .43 .78 .81 .50 -58 
9 127 б о 105 72 76  .98 .44  .50 .07 .26 .28 
10 i36 .46  .01 оз  .13 .15 41 .48  .55 .00 .6@ .90 
п 1226 .36 130 .43 72 .88. 19.35-64.76 1-33, .4l 
n Tia 0% :06  .41 .48 9 .0/ | 31.80 кое 
Los 3» lun de re.32 4207 0870 0:00 80000844" -53 
КА 120 24.04 .06 ла BL 25506750 08 206.0 AB. -18 
003 ‘34 .45 30 .35 146 .58  .41 .51  .78 .88 — .45 .49 
16 — 98 /49 00 .01 69 .91 .30 .41  .51 .55  .7& BS 
POUR. 03...54 то. эз, 51480 1:40 036 O 132. -27 
18 :20 88. .30 .49 , 04.04 42 1-09. O «6T 
19 51 160 ооо 0 .07 64 .31  .24 .90 <6 .86 


| "The asterisked columns have been corrected for criterion attenuation. 


| 


" 


666 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 5 


Predictive Validity Coefficients for Each Rating Format Hezacotomized by Value Ares 


Format Theoretical Political Economic Aesthetic Religious Social 


ы 


2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 


о ш ШОШ OI 711 199° .00 .06* .50 
10 .13 .54 .08  .55 .62 49 .62  .49 .54 07 
23 .35 .4 .65 .48 .54 33 .51  .01 <64 61 
29 .72  .04 .05  .45 .48 66 .07  .85 .80 15 
39 .44  .10 .11 .55 .59 ОИ 25 .83 11 
401. .02. 08... 06%, 87.40 200.77/42..70. .94 07 
A Б.Б НО .69 п 55 .59 .88 .90 56 
.05 .09 .04 .06  .81 .84 44 .51  .58 .06 20 
1:55 202-025) 9348 БӨ 82 87. 63 70 — .18 
«43. .59. 81 64 .43 341) .10 .19 46 .55  .8l 
-52 .060 07 .08  .40 .46 42 .54 .43 .50 24 
AL 250. $125 59889 14. .80 4l .46 .61 .66 4l 
.86 a @ ла .24 .27 49 .07  .57 .61 37 
.22 .29. .80 АЮ .04 .77 34 .43 .61 .69 43 
46 .66 .03 .06  .64 .85 52 .70 .63 .68 65 
-78 .90 01 .01 .06 .09 28 .33 SL .33 30 
.24 .84 18 .29 .03 .04 86 .54 .44 .55 39 
.54 .63 88 .60 .16 .22 60 .07 .31 .38 31 


* Corrected for attenuation. 


formats to dichotomous and trichotomous measures. Three critical 
ratios, computed to determine whether these validity and reliability 
vectors differed, resulted in nonsignificance (Table 8), demonstrat- 
ing that, regardless of the number of steps originally employed to 
collect the data, conversion to dichotomous or trichotomous measures 
does not result in any significant decrement in reliability or validity. 
Therefore, provided that an adequate number of items are contained 
on the inventory, increasing the precision of measurement does not 
eventuate in greater reliability or validity vectors. 


Discussion and Conclusions 


The evidence from the present study led us to conclude that both 
reliability and validity are independent of the number of scale points 
used for Likert-type items. Both internal consistency and stability 
Measures were obtained. The average internal consistency reliabil- 
ity across all areas was .66, while the average test-retest reliability 
was 82. Both reliability measures, test-retest and internal con- 
sistency, were found to be independent of the number of scale 
points. This finding is consistent with those reported by Bendig 
(1954), Komorita (1963), Komorita and Graham (1965), and Pea- 
body (1962), contrasts with findings by Symonds (1924) and 
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TABLE 7 
Reliability and Validity Coefficients for the Original and Reduced Rating Fo 4 
Test-Retest Reliability Concurrent Validity Predictive Validity 
Rating Original Collapsed Original Collapsed ^ Original СОШ 
Format Format Format Format Format Format Format j: 
2 .99 .99 .43 .43 :51 TF 
3 40 .70 AT AT .07 0 
4 .86 83 49 55 .62 | 
5 .83 .82 „52 41 .15 
6 .88 .80 .19 .28 .12 
7 .80 4 .20 .20 ‚08 
8 .88 ‚54 .51 .03 .56 
9 .82 .78 .26 .42 .21 
10 72 .82 .68 47 .19 .05 
11 .85 .82 .34 47 .32 ‚51 
12 .92 .88 -62 ‚64 .24 E 
13 77 .66 ET .16 .42 AL 
14 .68 .67 .15 .20 .38 SM 
15 .70 .65 .45 .40 44 3 i 
16 .82 т 74 .67 .66 AL 
17 .82 .80 T5 .04 .30 38 | 
18 .75 .62 .52 .36 .39 2 
19 .65 К .66 75 81 AB 
Note.—All values are based upon the social scale. 
1.0 
H -9 a 
С 
fal \ Pee" cy | 
3 7 Pa T 
Blue d 
з : 
1 4 Г 
“3 ; 
L. 
з 
$ M 
AUG NCR ES SO OAL DNS 14 15 16 17 18 
Format 
E 2. The test-retest reliability coefficients for the original and collapsed f 


* Original Rating Format, у 
** Collapsed Rating Format. j 
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Figure 3. The concurrent validity coefficients for the original and collapsed rating 
mats, 


' Original Rating Format. 
Collapsed Rating Format. 
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‘igure 4. The predictive validity coefficients for the original and collapsed rating formats. 
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TABLE 8 


Summary Table of the Tests of Significance on the Reliability and Validity Соеў- 
cients for the Original and Collapsed Rating Formats 


Original Collapsed Critical 


Criterion Format Format Ratio P 

Test Retest Relia- 
bility .82 .78 1.47 NS 
Concurrent Validity .45 .40 .80 NS 
Predictive Validity Em .33 EE NS 


Champney and Marshall (1939), and contests opinions proffered 
by Jahoda, Deutsch and Cook (1951) and Ferguson (1941) to the 
effect that the reliability of a scale will increase as the number of 
scale points increase. Based upon the evidence adduced thus far, it 
would seem that reliability should not be a factor considered in 
determining Likert-scale tating format, as it is independent of the 
number of scale steps employed. 

Cronbach (1950) claimed that if utilization of finer rating scales, 
as opposed to coarser ones, increases the reliability of measurement, 
then this increase should be attributed to the addition of response 
set variance and not to any increase in the refinement of measure- 
ment. Response set was not found to be a factor affecting the inten- 
sity component; finer rating scales did not yield an increase in the 
reliability of measurement over coarser ones. The extent to which 
response set bias influenced the directional component of the re- 
sponses is unknown. 

With respect to Cronbach’s (1959) and Komorita and Gra- 
ham’s (1965) contention that validity should be the ultimate cri- 
terion, as far as the authors can determine, this study is the first 
to attempt to assess the relationship between validity and number 
of scale points. As with reliability, validity was found to be indepen- 
dent of the number of scale Points contained in the rating scale. 


This finding remains even after correcting the predictive and con- 


current validity coefficients for criterion attenuation. Moreover, the 
same results were obtai 


ned for each of the areas on the modified 
Study of Values. We can conclude, therefore, that when considering 
the number of steps to employ in a Likert scale rating format, valid- 
ity need not be consider 


ed because there is no consistent relationship 
between it and the number of scale steps utilized, 


M 
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Inan attitude survey there usually is no manifest criterion present 
since behavior is not necessarily a function of attitudes. The choice, 
in this study, of whether to use either an internal measure (ї.е., cor- 
relation of each item with total score, less that item) or an external 
' measure was made in favor of the latter, The internal measure has 
no intrinsic relationship to external reality, while the external сгі- 
terion does. Directing attention to the obtained validity vectors, 
we note that, while not consistently high or low, in most cases 
they compare quite favorably with the bulk of those reported in 
the literature. Ghiselli (1955), in а comprehensive review of both 
published and unpublished studies, found that the range of aver- 
age validities for psychological predictors was in the .30's and 
low .40’s. An average of .50 was a distinct rarity. The average 
concurrent validity coefficient (corrected for attenuation) in the 
current study, across all formats and value areas, was .53. The 
average predictive validity (again corrected for attenuation) was 
512 

Komorita and Graham (1965), in discussing studies by Komo- 
rita (1963) and by Bendig (1954), stated that “if it is a valid 
generalization (i.e., independence of reliability and number of 
scale steps), the major implication is that, because of simplicity 
and convenience in administration and scoring, all inventories 
and scales ought to use a dichotomous, 2-point scoring scheme 
(p. 989).” Peabody’s (1962) results indicated that composite scores, 
consisting of the sum of scores on bipolar, 6-point scales, mainly 
Teflect direction of response and are only minimally influenced by 
intensity of response. He concluded from this that there is justifica- 
tion for scoring bipolar items dichotomously according to direction 
of response. This investigation provided empirical evidence in sup- 
port of these assumptions. 

The lack of any significant differences in reliability and validity 
stemming from the utilization of a particular format, or from 
collapsing a many-stepped format into a dichotomous or tricho- 
tomous measure, led to the conclusion that total scores obtained 
With Likert-type scales, as both Peabody and Cronbach have 
suggested, represent primarily the directional component and only 


ES 


+ 2 Concurrent and predictive validity vectors, uncorrected for criterion at- 
enuation, were 42 and .40, respectively. 
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to а minor degree the intensity component. Therefore, of the 
three components contained in a Likert-type composite scale 
score—direetion, intensity, and error—the directional component 
accounts for the overwhelming majority of the variance. 

It has been demonstrated that regardless of the number of 
steps originally employed to collect the data, conversion of these 
many-stepped response scales to dichotomous or trichotomous 
measures does not result in any significant decrement in reliabil- 
ity or validity. Therefore, increasing the precision of measure- 
ment does not eventuate in greater reliability or validity vectors, 
provided that an adequate number of items are contained in the 
inventory. 

One ramification of this finding, if substantiated, would be 
greater flexibility in the adoption of a given format for a given 
predictor, criterion, and subject. Since there appears to be in- 
dependence between reliability and validity vectors and rating 
format, desirable practical consequences might be obtained from 
allowing the subject to select the rating format which best suits 
his needs. This might result in the highly favorable consequence 
of increasing the subject’s motivation to complete the scale. Con- 
versely, if the respondent is not satisfied with a particular rating 
format, regardless of the reason, the possibility exists that de- 
leterious effects might result from the unsatisfactory rating for- 
miat-respondent interaction. This interaction could eventuate in 
a decrement in interest and/or reduced motivation to continue 
the rating procedure or to complete any remaining parts of the 
measurement process. 


A final consideration is the comparison of such data with data 


o o 


~~ 


E. 


MATELL AND JACOBY 65 


1 were previously collected with different rating formats. To 
this problem, previously collected data could be collapsed 
© dichotomous or trichotomous measures, This reduction in 
‘the precision of measurement, as demonstrated in this research, 
E not lead to any delcterious effects vis-a-vis reliability or 
Validity. The resultant response distributions, originally based 
Upon different rating formats, could then be directly compared 
since they would now all be projected from the same base mea- 
sure? 

_ А basic question appears to be whether the utilization of fine 
Tating scales increases the refinement of measurement over that 
which is obtained with coarse dichotomous or trichotomous scales. 
The overwhelming consistency of results of this study, in addition 
to those obtained by Peabody (1962), Komorita and Graham 
(1965), and Bendig (1954), strongly suggests а negative answer to 
this question. 

The primary practical implication of this study is that in- 
Yestigators would be justified in scoring attitude items dichoto- 
mously (or trichotomously), according to direction of response, 
‘after they have been collected with an instrument that provides 
for the measurement of the intensity component along with the 
directional component. 

` Further research should now be conducted to determine whether 
the present findings can be generalized beyond the Likert-type 
scale to different types of scales (e.g, Osgood's Semantic Differ- 
‘ential, Thurstone-type scales, graphic rating scales, etc.) and for 
other purposes (e.g the rating of behavior, personality, indus- 
trial work performance, etc.). It should also be determined whether 
"the conclusions are generalizable to different subject populations 
defined by such parameters as level of education or ability, and 
by psychological, experiential, demographic, and ecological char- 
acteristics, 
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VALIDATION BY THE MULTIGROUP-MULTISCALE 
MATRIX: AN ADAPTATION OF CAMPBELL AND 
FISKES CONVERGENT AND DISCRIMINANT 
VALIDATIONAL PROCEDURE 


JOHN A. CENTRA 


IN a well-known paper, Campbell and Fiske (1959) advocate 
an approach to investigating validity employing & matrix of inter- 
correlations among tests representing more than one trait, each 
measured by more than one method. Independent measures of the 
same trait, they state, should correlate higher with each other 
than they do with measures of different traits involving the 
separate methods. In addition these validity values should be 
higher than the correlations among different traits which hap- 
pen to employ the same method. Campbell and Fiske refer to the 
process as convergent (confirmation by independent measurement 
procedures) and discriminant (distinction of one trait from 
another) validation by the multitrait-multimethod matrix. 

The purpose of this paper is to present an application of the 
multitrait-multimethod procedure in a different context. While 
the procedure typically compares independent methods of mea- 
surement, the application proposed here compares discrete groups 
of individuals; and rather than different individual traits, this 
adaptation compares scale scores representing group responses to 
multidimensional perceptual space. If similar stimulus properties 
rather than particular subjective interpretations or group biases 
are being measured, then agreement between the discrete groups in 
the way they respond to each seale would be expected. As used 
here, therefore, intergroup agreement on a set of perceptual scales 
is being assessed, and validity may be defined as the degree of 
similarity in responses by different constituent groups. The adap- 
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tation might be referred to as validation by the multigroup- 
multiscale matrix. 

The adaptation will be illustrated in this paper by analyzing 
group perceptions of their college environment. Multidimensional 
scales assessing the viewpoints of faculty, administrator, and 
student groups comprise the multigroup-multiscale matrix. 


Method 


The environmental assessment instrument used in this study 
was the Institutional Functioning Inventory (IFI) (Peterson, 
Centra, Hartnett, and Linn, 1969). Constructed to measure the 
conditions and emphases at individual colleges and universities, 
respondents report on what their college is like—what activities 
go on, how people typically behave—rather than their own be- 
havior or attitudes. The IFI consists of 11 scales, each comprised 
of 12 items, to which faculty members and administrators on 
each campus respond; students are judged to be in a position to 
respond to only the first six scales since they are assumed to be 
less able to give meaningful responses to the particular items in- 
cluded in the last five scales. 

Titles and brief definitions of the IFI scales are given below. 
More complete scale descriptions as well as information on the 
development of the IFI may be found in the manual (Peterson 
et al., 1969). 

1. Intellectual-Aesthetic Extracurriculum (ТАЕ) refers to the 
availability of activities for intellectual and aesthetic 
stimulation outside the classroom. 

2. Freedom (F) has to do with academic and personal free- 
dom for faculty and students. Low scores suggest an in- 
stitution that places many restraints on all individuals in 
the campus community. 

3. Human Diversity (HD) has to do with the degree to which 
the faculty and student body are heterogeneous in their 
backgrounds and present attitudes. 

4. Concern for Improvement of Society (IS) refers to a desire 
among People at the institution to apply their knowledge 
and skills in solving social problems and prompting social 
change in America. 

5. Concern for Undergraduate Learning (UL) has to do with 
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mr degree to which the faculty and administration empha- 

— gize undergraduate teaching and learning. 

6. Democratic Governance (DG) has to do with the extent 
to which individuals in the campus community who are 

- directly affected by a decision have the opportunity to par- 
ticipate in making the decision. 

7. Meeting Local Needs (MLN) refers to an institutional 
emphasis on providing educational and cultural opportun- 
ities for adults in the surrounding area. High scores indicate 
availability of adult education, job-related and remedial 
curricula. 

8. Self-Study and Planning (SP) has to do with the impor- 
tance college leaders attach to continuous long-range plan- 
ning for the total institution. 

9. Concern for Advancing Knowledge (AK) has to do with 
the degree to which the institution emphasizes research and 
scholarship. 

10. Concern for Innovation (CI) refers to an institutionalized 
commitment to experimentation with new ideas for educa- 
tional practice. 

11. Institutional Esprit (IE) refers to а sense of shared pur- 
poses and high morale among faculty and administrators. 

Representative samples of faculty and administrators at 22 

diversified colleges and universities responded to the IFI during 
1968. At 17 of these institutions, a student sample also completed 
the first six scales of the inventory. The multigroup-multiscale 
analyses presented in this paper were based on the above sample 
of institutions. One matrix consists of three groups with six scales 
(N = 17) and the second matrix consists of two groups—faculty 
and administrators—with all 11 IFI scales (V = 22). Group means, 
compiled with the 12 items in each scale, were intercorrelated. 
Although the number of institutions was small, it represented 
several types and sizes of four-year colleges and universities. A 
larger sample would have been desirable but for the purposes of 
demonstrating the application presented here, the current sample 
would appear sufficient. 

Internal consistency reliabilities for each scale (coefficient alphas, 

Cronbach, 1951) were uniformly high (with one exception .86 
and higher). These reliabilities, computed for each scale and for 
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each of the three groups, provided an estimate of reliability based 
on the mean correlation among items within each scale. 


Results 


The intereorrelation matrix of faculty, administrator, and student 
responses to the first six IFI scales is presented in Table 1. The 
italicized entries in the diagonals, known as the validity values, are 
particularly important. In the first column, therefore, the validity 
value of .98 (italicized) is the correlation between faculty and 
administrator responses to the IAE scale, while in the same column 
-94 is the correlation between student and administrator responses. 
The nonitalicized values in the three rectangles represent the cor- 
relations of different groups responding to different scales, and 
with the three triangles are the standard intercorrelations of scales 
within each group. 

According to Campbell and Fiske (1959, pp. 82-83) four aspects 
of Table 1 bear on the question of validity. First, they recommend 
that the italicized entries (validity diagonal) be “significantly dif- 
ferent from zero and sufficiently large to encourage further examina- 
tion of validity.” With the exception of the DG scale, which has 8 
correlation of .20 between administrator and student responses and | 
а correlation of .30 between faculty and students, this require- 
ment is met (evidence of convergent validity through intergroup 
agreement). 

Second, each validity diagonal value should be higher than the 
values lying in the column and row of its rectangles; i.e. there 
should be more agreement among different respondents on the 
same scale than on different scales. Thus for the IAE scale the 
correlations of .98 (faculty and administrators), .94 (students 
and administrators), and .95 (students and faculty) are each 
higher than correlations for IAE and any other scale (within 
each of the three rectangles). On the other hand this is not true 
for the DG scale and in two instances for the F scale. That is, Ё 
responses by students correlate .80 with administrator HD Te- 


sponses, which is higher than the .77 correlation between student 


and administrator F responses; and student F responses correlate 


85 with administrator HD responses, which is higher than the 
“81 correlation between faculty and Students on Freedom. 


Third, Campbell and Fiske propose that cach validity diagonal 
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be higher than any of the intercorrelations involving that se 
within each group (the values in the triangles). Again, as indicat 
in Table 1, this requirement is met for each scale except DG 
F. For the Freedom scale, for example, faculty responses on F 
HD intercorrelate .84, which is higher than the .81 diagonal й 
F when faculty and student responses are correlated. 

The fourth and last aspect recommended by Campbell 
Fiske is that the same pattern or ranking of scale (trait) inte 
relationships be evidenced, regardless of the levels of correlatio 
involved. Inspection of Table 1 suggests that this requirement 
fulfilled, with some exceptions on the DG scale. 

The first criterion, according to Campbell and Fiske, provid 
evidence for convergent validity, while the last three criteria 8] 
evidence for discriminant validity. On the basis of the multigro 
multiscale matrix, it would appear that the DG scale is somewh 
lacking in both convergent and discriminant validity. Speci 
cally, while faculty and administrators agree fairly substanti 1 
about the extent of Democratic Governance on their campus (.76 
students disagree with both adult groups. In addition, the DI 
seale correlates highly with other scales within each group: 4 
with IS for administrators, .70 with IS for faculty, and .80 wi 
UL for students. 

The F and HD scales, according to the multigroup-multiseal 
analysis, intercorrelate highly within each of the response group 
and thus are somewhat lacking in discriminant validity. In spi 
of this, it would appear that these two scales are still valid fo 
faculty and administrators (both with values around .90 in tl 
validity), and only less so for students (evidence of convergen 
validity). In fact, for students, only the F scale’s validity dia 
onal value (.81) failed to meet the Campbell and Fiske crite 


Faculty-Administrator Response Comparisons 


Additional validity analyses by the multigroup-multiscale methi 
are presented in Table 2, in which faculty and administrator Г 
sponses at 22 institutions are compared. Among the last fi 
scales, the validity diagonal values of .96, .81, .96, and .64 
MLN, SP, AK, and IE respectively are each higher than val 
in their respective rows and columns. These scales, therefore, 
pear to meet several of the validity criteria. For the Conc 
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Intercorrelations between Administrator and Faculty Mean Responses to the I FI Scales 
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for Innovation scale (CI), however, faculty and administrator 
responses correlate .59. While this correlation is significantly 
greater than zero, four scales in the CI row have higher correla- 
tions, The CI scale, therefore, lacks both convergent and dis- 
criminant validity according to the Campbell and Fiske criteria. 
The remaining 10 scales, however, generally fulfill the four cri- 
teria, 
Biomed 

Interpretation of the multigroup-multiscale matrix differs con- 
siderably from that of the multitrait-multimethod matrix. Ac- 
cording to the latter, assessment methods converge (methodological 
triangulation) to confirm the measurement of a trait; accord- 
ing to the multigroup analysis, groups converge or agree in 
their assessment (perceptions) or stimulus properties (scales). 
И, as in this study, discrete groups respond similarly to scales 
meant to assess the environment they inhabit, then it becomes 
more reasonable to assume that the scales are measuring char- 
acteristics or conditions of that environment. If, on the other 
hand, there is lack of agreement between groups, environmental 
features (scales) are less objectively measured than they might 
be. In short, the scales more likely reflect subjective interpreta- 
tions of the environment rather than relatively objective en- 
vironmental constructs. Of course the logic of the multigroup 
approach assumes that each group is fully acquainted with the per- 
ceptual space (environment, in this study) represented by the 
scales. Otherwise lack of group agreement could simply indicate 
lack of knowledge by one group. 

In this study faculty and administrative groups agreed in their 
responses to 10 of the 11 environmental scales. For these 10 scales, 
therefore, similar institutional functions are being assessed re- 
gardless of whether the respondent group consists of faculty or 
administrators. The Concern for Innovation (CI) scale, on the 
other hand, which lacked convergent (agreement) validity, would 
appear to measure somewhat different institutional emphases; 
depending on whether faculty or administrators were respond- 
ane Similarly the DG and F scales apparently assess different in- 
stitutional functions when students respond than when faculty 07 
administrators respond. The existence of Democratic Governance 
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` and Freedom on campus, in other words, means something quite 
different to students than to either adult group. 

It is important to note, however, that the multigroup-multiseale 
adaptation does not confirm the theory underlying the measure- 
ment. As with the multitrait-multimethod procedure, the existence 
of a construct is established less by convergent validity than by 
predieting correlations with other measures or by other strategies 
of construct validity (APA, 1966; Cronbach and Meehl, 1955). 
Convergent and discriminant validity information through the 
multigroup-multiscale matrix does, nevertheless, appear to pro- 
vide useful insights into how а current instrument is functioning 
and might be improved. For example, in this study the Concern 
for Innovation scale (CI) probably needs to be more clearly 
delineated in view of its overlap with other scales and its rela- 
tively low intergroup agreement. Changes made in the scale 
would hopefully lead to improving its subsequent "validity co- 
efficient"; in this manner, as Campbell and Fiske suggest, valida- 
Чоп may be viewed as an ongoing program for improving meas- 
urement devices. 
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As samples increase in size, it is more and more likely that any 
given null hypothesis will be rejected at a statistically significant 
level. Many authors have discussed the implications of this fact 
(Baker, 1966; Lykken, 1968; Nunnally, 1960; Rozeboom, 1960) 
and have severely criticized the continuing use by psychologists of 
the statistical reject-accept hypothesis testing model. The essence 
of their criticisms is that when the model is used by psychologists, 
(eg. by applying F or ¢ statistics to examine mean differences), 
large samples are likely to yield “significant” statistics even when 
the magnitude of mean differences is trivial. Thus, scientific con- 
clusions derived from such strategies may be erected on an accu- 
mulating array of triviality (Dunnette, 1966). The problem has 
been stated forcefully by Hays, 1965; p. 326: “Virtually any 
study can be made to show significant results if one uses enough 
subjects, regardless of how nonsensical the content may be.” 

It is agreed by most such crities that psychologists should make 
greater use of measures expressing the practical importance of dif- 
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ferences instead of relying entirely on the hypothesis testing mod- 
els. One such statistic (Tilton, 1937) expresses the degree of over- 
lap between two distributions. The overlap (O) between score 
distributions obtained by two groups is defined as the percentage of 
persons in one of the groups whose scores may be matched by per- 
sons in the second group. In the diagram below, the shaded ares 
designates the overlapping portion of two distributions. Two groups 
which overlap completely will have identical scores and an О of 
100 per cent. Two groups which are entirely separated will, of 


course, have different means; but, more important, the O will be 
0 per cent. Values of О between 0% and 10096 give meaningful and 
easily interpretable estimates of the practical importance of 8 
mean difference. Tilton’s statistic is based.on the ratio of the differ- 
ence between the means of the two groups to the average of the 
two standard deviations. Table 1 (from Tilton, 1937, p. 658) 
shows the relevant information for estimating the value of О. The 
table was developed by assuming that the two sample distribu- 
tions of Scores are random samples from normally distributed 
populations with the same standard deviations, assumptions which 
Guion (1965) claims are rather rarely fulfilled when working in 
an applied situation. 

We agree with Guion that it is rare indeed for scores in two sam- 
ples to be perfectly distributed normally or to have identical stand- 
ard deviations. However, we must disagree with him about the 
presumed rarity of normality in population distributions; and, of 
course, the equivalence of the population standard deviations сап 
be estimated by computing confidence intervals based on the sam- 
ple statistics. 

Even so, it is important to examine the effects on the value of 0 
of non-normality in the population distributions and of departures 
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from equality in population variances. In other words, how robust 
is Tilton's overlap statistic? 


Method 


In order to investigate the robustness of O, we used a digital 
computer to generate pairs of populations of predetermined size 
and shape with specified means and variances. Fifty samples of 
specified size were then selected randomly from each pair of popu- 
lations. (After a pair of samples had been drawn and Tilton's O 
calculated, their “scores” were returned to the populations.) De- 
scriptive statistics were computed for the distribution of 50 sample 
Tilton overlap values. The mean and median sample values were 
then compared with the actual overlap between the two popula- 
tions. The actual population overlap was obtained simply by count- 
ing the number of scores in one population that could be matched 
by scores in the other population and converting this count to а 
percentage of each population’s N. 

All the populations developed consisted of 2000 whole numbers. 
Normal curves were generated by using a computer program which 
produced psuedo-random numbers. This generator “drew” num- 
bers from a normal distribution having a mean of zero and a vari- 
ance of one. The random numbers were then transformed to the 
desired mean and variance by a simple linear transformation. Non- 
normal distributions were generated by using a computer program 
that allowed the user a great deal of freedom in establishing the 
shapes and central tendencies of the distributions to be used. The 
only two limitations in forming non-normal distributions were that 
the distribution could have no more than 100 unique score values 
and the total number of observations could be no larger than 2000. 

In addition to calculating the mean and variance of each popu- 
lation distribution generated, measures of skewness and kurtosis 
were computed. The method used to calculate the measures of 
kurtosis and skewness were those given in McNemar (1962, PP: 
25-28, and pp. 78-79). When the population’s skewness index is 
positive, its distribution is skewed to the right; a negative index 
indicates a skewed-left distribution. When the populations 
kurtosis index is less than zero, its distribution is somewhat flat- 
topped; when the kurtosis index is greater than zero, the distribu- 
tion is peaked with a higher tail than those of a normal distribu- 
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Чоп. (Examples of distributions with their kurtosis ог skewness 
indexes can be found in MeNemar, 1962, p. 27, and in Ghiselli, 
1964, p. 58.) 


Analyses and Results 


The analyses consisted of two phases. First, the effect of unequal 
population variances upon Tilton's overlap values was examined. 
Second, the impact of non-normal population distributions upon 
sample Tilton’s overlap values was studied. 


Unequal Population Variances 


The first analyses were intended to investigate the effects of un- 
equal population variances upon estimates of population overlap 
calculated using Tilton’s method with sample data. For each 
analysis, a pair of normal populations was generated with a de- 
sired mean difference and a specified relationship between their 
variances. One population of each pair always had a variance 
near 100 while the other population’s variance was made some de- 
sired multiple of 100. If, for instance, the first distribution had a 
variance of 100, and the second a variance of 121, the populations 
were said to have a variance ratio of 1.21:1. The other variance 
ratios used were 1:1, 1.44:1, 2.25:1,3.9:1, and 8.9:1. 

Normal populations having the six population variance ratios 
were then generated having each of these five population mean 
differences: 0, 5, 10, 20, and 30. Fifty pairs of random samples of 
100 observations each were then drawn from each of the 30 popu- 
lation variance-ratio X mean-difference combinations, and the 
mean sample Tilton's O calculated for each set of 50 values. The 
value of the population overlap, determined by calculating the 
number of scores in one population that could be matched by scores 
in the other population and dividing this total by the size of each 
Population (N = 2000), was then subtracted from the mean sam- 
ple Tilton’s overlap. These differences were then used to assess how 
well the Tilton’s overlap values approximated the overlaps of the 
Populations from which the samples were drawn. Figure 1 shows 
the findings of the investigations concerning the effects of popu- 
lation heteroscedasticity and varying population mean differences. 

Two major conclusions can be drawn from the information 
shown in Figure 1: (a) at any particular population mean differ- 
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50 a. base variance = 104 
b. both sample n's = 100 
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Figure 1. Value of the mean sample Tilton’s overlap minus the population 
overlap for pairs of normal populations а,Ь. 


ence, the greater the violation of the assumption of equal variances, 
the greater is the deviation of the typical sample Tilton’s overlap 
value from the population overlap figure; (b) the effects of any 
particular violation of the assumption of equal population vari- 
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апсе decreases as the population mean difference increases. It is 
also somewhat reassuring to note that the errors shown by the de- 
viations graphed in Figure 1 are all in the conservative directions 
—ihe Tilton's; that is, overlap values obtained from samples tend 
to overestimate the overlap of the populations from which they are 
drawn. 


Effects of Non-normal Population Distribution 


The effect of distribution shape upon Tilton’s overlap as an 
estimator of population overlap was investigated by using non- 
normal distributions established by the investigators. Table 2 con- 
tains the descriptive statistics of the eight non-normal distribu- 
tions which were used. Three of the distributions are skewed right, 
three skewed left, one rectangular, and one bi-modal. Each of 
these distributions contained 2000 “scores.” 

With eight non-normal distributions and the various normal 
distributions possible, a large number of potential pairings were 
available, This number was increased further by the variety of 
population mean differences which were possible. The decision was 
made to pair first each of the non-normal distributions with a nor- 
mal distribution of the same size (N) and the same variance. Using 
distributions of equal Ns and variances controlled these variables 
so that the effects of varying shapes could be seen alone. 

In every pairing of a non-normal with a normal population, the 
normally distributed population was established having a mean 
higher than that of the non-normal population. This fact is partic- 
ularly important to remember when examining the analyses in- 


TABLE 2 

The Symmetry Statistic and the Mean and Variances of the Eight Non-normal 
Distributions 
Symmetry 
Distribution Type Mean Variance Statistic 

Skewed Right (SR) 44.47 51.18 48 
Skewed Right (SR2) 41.61 51.34 UT 
Skewed Right (SR;) 42.13 51.00 -70 
Skewed Left (SL,) 48.53 51.18 — 48 
Skewed Left (SL) 48.43 51.59 —.76 
Skewed Left (SLs) 49.97 51.00 —.10 
Rectangular 62.00 52.05 0.00 


Bi-Modal 47.99 53.92 .07 
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cluding a skewed population. A skewed-right population would 
intersect the normal population with its skewed tail. A skewed- 
left population would have the larger part of its distribution over- 
lapping with the normal population. 

The sampling distributions of Tilton's overlap values were 
formed by taking 50 pairs of samples from the populations in- 
volved in each analysis. All samples had ns of 100. The mean 
Tilton's overlap values were used in describing the central tend- 
encies of the sampling distributions. Table 3 includes the popula- 
tion overlaps and the mean sample Tilton’s overlap for each of 
the population pairings investigated. 

The data in Table 3 show rather remarkable agreements be- 
tween the population overlaps and the mean Tilton's overlap sam- 
ple values. The exceptions to this statement are all found in the 
cases where populations were established with a mean difference 


TABLE3 
Values of the Population Overlap and the Mean Tilton Overlaps from Analyses 
Involving Non-normal Populations 
ج پپپ‎ yaaa 


t А Population Mean 
Distributions Mean Population Sample 
Paired Difference Overlap Tilton's 

Skewed Right 0 85.00 93.38 

(SR) and Normal 5 67.80 70.90 

А 10 47.40 47.04 

Skewed Right 0 78.00 99.60 

(SR:) and Normal 5 65.10 72.96 

| 10 46.00 49.02 

Skewed Right 0 81.50 99.66 

(S83) and Normal 5 66.90 72.50 

10 46.40 48.70 

Skewed Left 0 86.40 99.28 

(SL;) and Normal 5 75.80 71.54 

10 51.20 47.54 

Skewed Left 0 80.00 99.72 

(SLs) and Normal 5 73.40 71.64 

10 49.30 47.76 

Skewed Left 0 82.70 99.86 

(51) and Normal 5 75.80 72.20 

10 50.20 48.24 

Rectangular and 0 80.50 99.00 

Normal * 74.30 71.76 

T 54.09 47.62 

Bi-Model and 0 68.70 98.66 

Normal 5 63.50 71.58 

10 51.00 48.28 


* Both sample лз = 100. 
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‘of zero. The sample Tilton's values in these cases are too high be- 
cause of the variation of the sample mean differences around the 
population mean difference of zero. When Tilton's index is near 
zero, the corresponding value of Tilton's overlap is near 100 per 
cent. It is somewhat paradoxical that the Tilton's overlap estimate 
obtained when the populations have a mean difference of zero 
seems to make “more practical sense" than the count overlap on 
the same populations. Results from Tilton’s technique make more 
practical sense because they yield a value near 100 per cent while 
the count overlap takes on values of less than 100 per cent—the 
amount that it is less depending upon the relationships between 
the population’s Ns, variances, and “shapes.” 

The data in Table 3 also indicate that for the skewed-right dis- 
tributions the differences between the typical Tilton’s overlap 
sample values and the population overlaps inerease as the popula- 
tions become increasingly skewed. For the skewed-left populations, 
the same differences decrease as the skewness increases. The over- 
laps of the rectangular and normal populations were always un- 
derestimated by the sample overlaps. The direction of the error 
when bi-modal populations were used depended upon the differ- 
ence between the means of the two populations. 


Effects of Two Non-normal Population Distributions 


From the many combinations of non-normal population distri- 
butions that could have been formed, only six pairings were in- 
vestigated. The means, variances, and symmetry statisties of the 
population distributions which were used are given in Table 2. 

The accuracy of the mean sample Tilton's overlaps in Table 4 
is somewhat surprising. The most inaccurate results were obtained 
from the pairing of two populations which had bi-modal distribu- 
tions. The mean Tilton's sample overlap is 15 per cent greater than 
the population overlap; here again the error is in the conservative 
direction. When both the magnitude and the direction of the errors 
are considered, the pairing of two rectangular populations causes 
the most important error in the overlap estimates given by Til- 
ton’s method. In this case, the mean Tilton's sample overlap is 
11.9 per cent below the overlap of the two populations’ distribu- 
tions. 

"These writers by no means claim that the analyses shown in 
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TABLE 4 


Values of the Population Overlap, the Mean Tilton Overlap, and the Standard Error 
of Tillon's Overlap from Analyses Involving Pairs of Non-Normal 
Р 


opulations *. + 
Standard 
Population Population Mean Error of 
Distributions Mean Overlap Sample Tilton’s 
Paired Difference (%) Tilton's (05) Overlap 
SR, and Rectangular 19.87 17.2 16.12 2.946 
Sl, and Rectangular 12.13 42.9 68.64 3.840 
SR, and SL, 7.74 54.5 56.60 6.200 
Bi-Modal and 
Bi-Modal 10.00 34.0 49.00 4.087 
Bi-Modal and 
Rectangular 14.00 37.5 32.38 3.678 
Rectangular 
Rectangular 10.00 60.0 48.10 4.300 


Table 4 fully illuminate the effects of non-normal population dis- 
tributions. It is hoped, however, that the variety of pairings used 
in these analyses will at least let the reader form a general im- 
pression of the impact of non-normality upon Tilton’s overlap 
values obtained for samples from non-normal populations having 
nearly equal variances. 


Effects of Non-normal, Heteroscedastic Population Distributions 


Six analyses were run in an effort to throw a small ray of light 
on the effects of having pairs of non-normal populations with un- 
equal variances. Table 5 includes three pairings of non-normal 
populations having equal variances and the same three pairings 
when the populations had unequal variances. 

The results in Table 5 are quite uniform: when the population 
variances are extremely different, the typical Tilton’s sample over- 
lap values greatly overestimate the population overlaps. When 
the population variances are made widely different, the differences 
between the mean Tilton’s sample values and the population over- 
laps all increase and change in direction from underestimation to 
overestimation. The effects of unequal variances when the popu- 
lations are non-normal is very similar to the results found with 
normal populations having unequal variances shown in Figure 1. 
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TABLE 5 


of the Effects of Non-normality and Heleroscedasticity—All Population Ма = 2000 
Both Sample Na = 100 


| Paired Difference Variance (%) Tilton's Tilton’s 

ША 51.00 

P 12.13 42.9 38.04 3.840 
‘Rectangular 52.05 
E 51.00 

12.63 46.2 55.08 4.087 
Ire =“ 
53.92 

and 14.00 37.5 32.28 3.078 
p RE 
1 53.92 

vd 14.51 41.1 50.18 4.017 
Rectangular 208.46 
‘Rectangular 52.05 

E 10.00 60.0 48.10 4.360 
52.05 
Rectangular 52.05 

| 9.50 50.0 65.30 4.534 
208.46 


As сап be seen in both Figure 1 and Table 5, the errors (differ- 
ences between the mean sample Tilton's overlap and the popula- 
tions’ overlap) are in the conservative direction. The practitioner 
сап feel reasonably sure that using Tilton's method with samples 
from populations having unequal variances, whether the popula- 
tions are normal or non-normal, will not result in an underestimate 
of the overlap existing in the parent populations. 


= 


Summary 

Pairs of populations were established having designated Ns, 
variances, means, and distribution “shapes.” Fifty pairs of ran- 
dom samples of some desired size were then taken from the pair of 
populations. Descriptive statistics were computed {ог the dis- 
^ tribution of 50 sample Tilton overlap values. The mean sample 
values were then compared with the count overlap of the two popu- 
lations. The count overlap was obtained by calculating the num- 
ber of scores in one population that could be matched in the other 
population and converting this count to a percentage of the two 

Populations’ Ns. 
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The results show the typical sample Tilton overlaps accurately 
reflect the population overlap under most violations of the assump- 
tions underlying Tilton's statistic. However, unequal population 
variances, when the populations are both normal, cause the Tilton 
sample values to greatly overestimate the population overlap. This 
error does decrease as the populations! mean differences increase. 
When the populations having unequal variances are both non- 
normal, the Tilton sample values still overestimate the population 
overlap. Tilton's overlap, as was mentioned earlier, may be con- 
sidered to be more meaningful than a count overlap when the 
populations have unequal variances. The separation of populations 
having unequal variances seems to be better represented by Til- 
ton's technique than it is by а count overlap measure. 

The effects of distribution shape are expectedly variable. When 
only one of the populations in the pair is normally distributed, and 
has а standard deviation nearly equal to that of the non-normal 
distribution, the typical sample Tilton estimates of the population 
overlap are all quite accurate. If both populations in the pair are 
non-normal in form, the magnitude and the direction of the error 
depends upon the shapes of the curves used in the analysis. Using 
а pair of bi-modal populations results in the greatest discrepancy 
between the typical Tilton sample value and the population over- 
lap. The error in this case is again in the conservative direction— 
the sample Tilton overlaps overestimating the population overlap. 

The practitioner will probably wish to pay greatest heed to 
the effects of unequal population variances, With normal distribu- 
tions, sample Tilton’s overlaps will substantially overestimate the 
population overlap unless the two populations have unequal 
means. As the populations’ variances become more unequal this 
overestimation becomes greater at any particular population mean 
difference. The three analyses using heteroscedastic non-normal 
populations also show the sample Tilton overlaps to overestimate 
the population overlap. The direction of this error may cause the 
user to reject some measures, but it should not lead him to choose 
any bad ones. If an investigator's purpose is to select a measure 
that effectively separates two groups, then the Tilton estimate 
probably reflects quite accurately the conservativism that should 
be a part of such decisions. 
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THE RELATIVE EFFICIENCY OF REGRESSION 
AND SIMPLE UNIT PREDICTOR WEIGHTS 
IN APPLIED DIFFERENTIAL PSYCHOLOGY! 


FRANK L. SCHMIDT 
Michigan State University 


A very common problem in the behavioral and social sciences 
is the prediction of the standing of a person or thing on one vari- 
able, usually designated the criterion, from his or its standing on 
a number of other variables, usually called the predictors. Least- 
squared error multiple regression weights are most commonly used 
in weighting the predictors into a composite. These weights, which 
minimize, over the cases in the sample, the sum of the squared de- 
viations of the observed from the predicted criterion score, are cal- 
eulated from the normal equations which express the minimization 
eonditions (Anderson, 1958). The fitting of the regression weights 
to the idiosyncracies of the initial sample leads to a decrease in 
effectiveness when these weights are applied to a new sample in 
which these particular idiosyncracies are not present. This “shrink- 
age" is often substantial in practical situations (е.р., see: Kurtz, 
1948; Cureton, 1950; and Kirkpatrick, 1951), especially when the 
initial sample is small. And small samples, as Lawshe and 
Schucker (1959) point out, are the rule rather than the exception 
in many areas of applied psychology. 

Certain other approaches to the weighting problem produce 


= 
1This study is based on a dissertation submitted in partial fulfillment of 
the requirements for the PhD degree at Purdue University. The author is 
indebted to Professors Hubert E. Brogden, Joseph Tiffin, and Norville M. 
Downie for their assistance and advice and to Dr. Vernon Urry for writing 
е necessary computer program. 
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weights independent of information in the first sample. Raw pre- 
dietor scores may be summed to yield a composite in which each 
test is weighted by its standard deviation. Or, weights may be de- 
rived a priori from a theory. Another approach is to standardize 
each predietor and then sum predictor scores into а composite, 
thus weighting all predietors equally. In the present study, the 
latter approach to weighting was compared to multiple regression 
methodology in the data domain of applied differential psychol- 
ogy. Because of the many socially and economically important de- 
cisions made about individuals and groups of individuals on the 
basis of psychological measurements of various kinds, the weight- 
ing problem is perhaps of more practical significance in this data 
domain than in many other areas of the behavioral sciences. 

Of previous studies, the most significant is probably that of 
Lawshe and Schucker (1959). These researchers examined the 
relative efficiency of four test weighting methods: (1) the simple 
addition of raw scores (whieh weighted each test by its SD); (2) 
weighting of raw scores by the test SD (which weighted each test 
by its variance); (3) weighting of raw scores by 1/SD of the test; 
and (4) least squares multiple regression weights. Three batteries 
of three tests each were used, each with a different average pre- 
dictor intercorrelation (23, .37, and 61); average validity was 
held constant across batteries. The criterion was a dichotomized 
GPA measure. Regression weights were derived on samples of 20, 
40, and 90, and all equations were applied to two hold-out samples 
of 75 each. None of the weighting methods showed any advantage 
over any of the other methods. Also, regression weights derived оп 
the larger samples did not show superiority to those derived on the 
sample of 20. 

Trattner’s (1963) findings with respect to different predictor 
weighting methods are confounded with the effects of different test 
selection methods used. It should be noted that the problem as de- 
fined in the present study assumes no a posteriori selection of pre- 
dictors. The assumption is that either all predictors selected on ап 
a priori basis are used in the final equation or that selection among 
the various predictors is independent of their performance in the 
first sample. Using a first sample of approximately 100 and cross 
validation samples ranging from 60 to 125, Trattner found that 
none of the four combinations of test weighting and selection 
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methods that he examined showed any advantage over the other 
combinations. 

Wesman and Bennett (1959), using Ns of 262 to 449, found no 
advantage for least squares weights over a simple summing of raw 
scores in the prediction of grade point average from the three sub- 
tests of the College Qualification Test. Grant and Bray (1970), 
with Ns in the neighborhood of 200, found that the decrease in 
correlation in the cross-validation sample resulting from the use 
of unit rather than regression weights was only .01. 

Researchers concerned with the effects of item weighting within 
tests have almost invariably concluded that little or nothing is to 
be gained by such weighting under most conditions encountered 
in practice (e.g. see: Guilford, Lovell and Williams, 1942; Har- 
per and Dunlap, 1942; Phillips, 1943). 

Two other studies not directly concerned with predictor weight- 
ing have produced additional evidence for questioning the utility 
of differential weights (Ryans, 1954; Ewen, 1956). 

There is at least one rather basic criticism that сап be leveled 
at all of these studies: they have all compared the performance of 
different weighting methods in a new sample. What is really needed 
is information on the relative performance of the different methods 
when sampling is not a factor, i.e., information on their perform- 
ance in the population. When the applied psychologist is faced, for 
example, with the task of deciding between two different sets of 
weights to be applied to a given battery of tests to predict a cri- 
terion of, say, job success, the information he needs to make this 
decision is not how the two sets of weights compare in some new 
sample of 30, 80, or 100, but rather how their performances com- 
pare in the long run, i.e., in the population of potential applicants 
for the job. While it is true that a random sample can provide an 
unbiased estimate of the relative performances of different sets 
of weights in the population, sampling error may erase or even re- 
verse in the sample the actual inferiority-superiority relationship 
existing in the population. In the Lawshe and Schucker (1959) 
study, for example, cross-validation on samples of 75 revealed 
no difference in performance between composites produced by 
weighting each test by its SD and those produced when each test 
was weighted by 1/SD. It is safe to speculate that these two sets 
of weights, correlated —1.00, differ in their efficiency in the popu- 
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lation. Brogden (1946), Among others, has shown that, in certain 
situations, even small increments in the correlation coefficient сай 
be of significant practical import. 
Definitions of symbols used in this study are as follows: 
р = the number of predictors used. 
N = sample size, m 
=. = the (р + 1) X (р + 1) population correlation matrix. 
= the p X p population correlation matrix of the p predictors. 
I, =the p X 1 population vector of predictor validities. 
R=» (p + 1) x (p + 1) sample correlation matrix. 
= s7 X p sample correlation matrix of the p predictors. 
^(f) = the population multiple correlation produced by the actual, 
infallible population weights, 8. 
#18.) = а population correlation produced by a set of fallible 
regression weights, f, computed on а sample from the 
Population, 


Figure 1 shows that, for any fixed N, р and Z,,, (б) forms а 
A in the population with Some variance, e, (B). р(В) and 
чалу" үн мым, are point distributions, each with zero vari- 

= And 1 are constant in value and, whenever applied to the 
Un ANM AO and p(g), respectively, with a probability of 
М ze», in contrast is Produced by a somewhat different 


drawn. Thus if one dra ber 
of sumpi, е draws a large num 

resulting gach of size N, and computes а й, for each sample, the 
К ЖШ form a distribution more or less 


Point ¢ on the Abscissa represents «p(3), and the distance 


ac 
а. em кез ario сыл, i: the population of regression 
р, Z,,. It is this di that 
the present study was күш à ue It is this distance 
of estimate, 


selection problem led to the 
of а Monte Carlo methodology, Because the equation for 
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MAGNITUDE IN CORRELATION UNITS 
Figure 1. Depiction of the distributions of p(8), p(1), and »(8) in the popula- 
tion fot a given N, p, and Ea Point с represents «(8) and point b represents а 
tandom value of (3). 


` 

[ 

| the density function of ‚) is presently unknown, an analytic solution 
is not now possible; and the use of empirical data presents numerous 
difficulties (Schmidt, 1970). On the other hand, the Monte Carlo 
&pproach allows the exact determination or very accurate estimation, 
for any given N, p, and Z,, of all of the values and distances of 
interest shown in Figure 1. (8) and p(/) can be computed by means 
of the familiar formula for the correlation of a standardized variable 
With а composite: 


AW) = en, @ 


(yz. W) 
Where W is any given set of weights, here either В or 1. For рд), 
for any given N and Z„,, a large number (v) of sets of B. can be com- 
Puted and applied to the population, producing by equation (1) 
& distribution of (3) comparable to the one in Figure 1. Then 
Eri $(8)/v gives а very accurate estimate of «р(Й). Not only т, 
but also N, р, and Z,, can be arbitrarily specified by the researcher. 
In addition the fact that the Monte Carlo approach works with the 
і multivariate normal population means that the entire study is based 
OR the random or correlational model of prediction, which is more 
Appropriate than the fixed predictor model for most behavior science 
data (Burket, 1964). The value ep(B) — р(1) is the mean superiority 
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(if any) of sample least square weights over unit weight for a given 
N, p, and Z,,. This is the value with which this study is primarily 
concerned. 

The simulation approach is not entirely without problems however. 
In order to insure that the X., used are similar to those actually 
existing in the data domain of interest, it seems appropriate to employ 
sample correlation matrices (R) from the data domain as estimates 
of the Z,,. Although the R are, for all practical purposes unbiased 
estimates of the Z., sampling error in the R will cause the variance 
of the validities and predietor intercorrelations to be larger, on the 
average in the Ё than in the Z,, This increased variance has the 
effect of increasing the average efficiency of regression relative to 
unit weights, thus exaggerating somewhat the utility of regression 
weights. Another effect to be expected is an increase in the incidence 
of suppressor variables. 

There is another potential problem. Significant departures of 
empirical data from the assumptions of linearity, normality and 
homogeneity of conditional variances—assumptions basic to the 
regression model—apparently occur in about 20 per cent of empirical 
data samples (Sevier, 1957; Tupes, 1964; Ghiselli, 1964; Guion, 
1965) and probably occur in a certain (but smaller) percentage of 
the empirical populations. Since such departures never occur in 
simulated populations, the results from Monte Carlo data may 
overestimate the efficiency of regression weights relative to unit 
weights in an empirical population. Because of these two considera- 
tions, Monte Carlo estimates of «p(8) — p(1) should be interpreted 
as maximal estimates of difference in performance between the two 
weighting methods, 


Method 
The Program 


Sample correlation matrices (RB) computed from randomly drawn 
samples from a multivariate normal distribution are distributed as 
W(N, p, Zz), the Wishart distribution. Using this distribution, for 
each given N, p, and Zz, combination, 100 R matrices were generated 
and 100 Ê; were computed, which, in turn produced 100 о(д:) coeffi- 
cients by equation (1). These coefficients were plotted to yield a 
О] distribution as shown in Figure 1 and averaged to give eo). 


| 
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The process of R-generation works by means of the Bartlett de- 
composition of the Wishart distribution (Bartlett, 1933; Kshirsogar, 
1959; Wijsman, 1957), and is outlined in Appendix 1. This approach 
to sampling from N(x, >) has been discussed and employed by 
Browne (1968) and Herzberg (1969), both of whom have carried 
out tests of the simulated data showing that the required assumptions 
are met. In addition Herzberg (1969) showed that the results from 
simulated data were almost identical with the results from a large 
sample of empirical data. Values of N and p lying within the ranges 
most frequently encountered in practice were chosen. Ns investigated 
were 25, 50, 75, 100, 150, 200, 500, and 1000. Values of p used were 
2, 4, 6, 8, and 10. For each N-level within each 2.,, the difference 
«p(8) — p(1) was computed, with each «p(8) being based on 100 values 
of (3). The program also performed other operations not relevant 
here.” 


Sampling Plan for Ê., Matrices 


As mentioned above, sample correlation matrices are probably 
the best available estimates of population correlation matrices. 
Accordingly, a plan was set up to obtain a random sample of E 
matrices from the data domain of applied differential psychology. 
Four journals—Educational and Psychological Measurement (EPM), 
Journal of Applied Psychology (JAP), Journal of Educational 
Psychology (JEP), and Personnel Psychology (PP)—were selected 
as representing the general area, and the years 1959-1969 were 
selected for examination. For two of the journals—-EPM and JEP— 
the odd years were sampled and for the other two, the even years. 
In both cases, all correlation matrices of dimensions 3 X 3 to II 
Were recorded. In some cases, only parts of larger matrices were 
used. Correlation vectors containing negative or zero values were 
not used as validity vectors, and it was sometimes necessary to 
rearrange rows and columns in order to meet this condition. An 
attempt was made to keep all validities above .20—a value chosen 
38 approximately the minimum that would ordinarily be used in 
practice. For each matrix size, a random sample of 10 matrices was 
drawn from the pool of recorded matrices of that size and used as 
-_ 


* Requests for print-outs of the program should be sent to the author, De- 
partment of Psychology, Michigan State University, Hast Lansing, Michigan. 
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estimates of the X, For certain of the larger matrices, а sample of | 
10 eould not be obtained using only the volumes designated in the 
sampling plan, and additional samples had to be taken from the 
previously unused volumes. Even so, only eight 11 X 11 matrices 
eould be found and two had to be taken from another source 
(Wechsler, 1949, p. 10). Obviously, the sampling fraction was much 
larger for the large than for the small matrices. x" 


Results 


Table 1 presents the mean superiority across all matrices of 
regression over unit weights («{ep(8) — Р(1))) for all 40 combinations 
of N and p. Negative values indicate that unit weights are superior 
to regression weights. When N = ©, ep(8) = (8) and «[ep(8) — p(2)] 
becomes «р(8) — p(t)], the average difference in the populations 4 
between the correlations produced by the (error-free) population 
regression weights and unit weights (distance a-d in Figure 1 for a 
given Z,,). The last column of Table 1 presents the average values 
of (Й) — р(1)) across p-values at each N level. 

An examination of the 8 for each of the 50 Z,, matrices revealed 
that 31 had one or more suppressor variables. Since suppressor + 
variables are rarely used in applied differential psychology, it is ' 
probably hazardous to generalize to this data domain from the 


TABLE 1 
Mean Superiority across All Matrices of Regression Weights over Unit Weights [e(«p(B) — P 
for all Combinations of N and p* 
р Mean 
N Across 
2 4 6 8 10 Р 
25 —.0133  —.0205  —.00021 — ы; — .0578 
50 0120 OMS) ODE еа = Te 
75 0175 0226 .0350 .0161 —.0095 0163 
eed 0998 (0486 ов — .0007 05 
A: eee 0358 .0610 .0486 0232 0383 
p 0898 .0002 0514 0297 пш 
0269 0455 .0769 .0643 0449 051 
1000 0277 0471 .0792 0684 0499 
г. 0286 0491 0831 0725 0551 0577 
No. of Matrices 10 10 10 10 10 50 


* For any p value, for each of the 10 matrices, at each N level, ¢p(4) is the average across 100 compt! 


ted 
of p(8). The difference ep) — p(t) is a а H mL 
vies FE cO О рз e net gre ан o^ 
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present matrix sample. Therefore, Table 2, showing values of ¢{«p(3) — 
»(1)] for only those Z,, matrices without suppressors, was computed. 

The bottom row in Table 2 indicates the number of matrices at 
nd p level without suppressors. 

The ratio {«ep(3) — »(2))/de(8) — p(1)]} is the difference in 
efficiency between unit and regression weights as a proportion of 
the maximum possible difference and is graphed in Figure 2 against 
sample size for all p-values for the entire sample of 50 Z,, matrices. 
Negative values indicate the superiority of unit weights. The value 
of N at which this ratio is zero is the sample size at which regression 
and unit weights are equally effective for a given number of predictors. 
Table 3 presents these “critical sample size" estimates calculated 
for the entire sample of Х,, matrices and also for the matrices without 
suppressors. In this Table, the critical sample size values for p = 6 
in the entire Z,, sample and p = 4 in the suppressorless sample have 
apparently been distorted by error in the Z., samples. Rechecks 
of the computations showed no errors. The “critical sample size" 
for p = 6 for the whole Z,, sample can probably safely be assumed 
to be somewhere near the midpoint between 44 and 60. For the 
suppressorless Z,,, the “break-even” sample size for p = 4 probably 
lies near the midpoint between 40 and 105. 


Di 3 
Tt should be noted in Tables 1 and 2 that, because of the idio- 
syneracies of individual matrix samples, the values of «[ер(Д) — ])1(م‎ 


E TABLE2 
Superiority of Regression Weights over Unit Weights [«(«»(8) — Р(1))] for Those Matrices 
Without Suppressors 


N 
5 —.0233 —.0669 —.0652 —.1234  —.1342  —. 
50 .0041 —.0349 —.0190 —.0418 —.0503 — —.0284 
15 .0084 — —.0168 —.0047 —.0107 —.0224 —.0110 
100 0124 — — 0109 "0121 — —.0005 —.0119 — —.0016 
150 0133 —.0033 10209 10065 — —.0041 0067 
200 .0148 .0007 10253 0105 -0009 10104 
Í Te 0174 0068 10333 0241 0090 -0181 
: .0180 0089 10361 10280 10188 10206 
EL S 0149 0230 


((1)م-4 


є(єр( 
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Figure 2. («e(8) — Р(1))/ (8) — р(1)]} as a function of N for all values of p. 


do not consistently decrease with increases in the number of pre- 
dictors. With large samples of Z., matrices, this would probably not 
be the case, but with samples of 10 matrices or less at each p value, 
the additional error introduced into the Ê: as a result of the loss of 
one degree of freedom from the addition of one predictor is apparently 
not as important in determining differences between columns in 
Tables 1 and 2 as chance fluctuations, from matrix sample to matrix 
sample, in the value e[p(8) — р(1)]. In Table 1, it can be seen Шай, 
for the entire sample of 50 Z„, matrices, unit weights are superior 


TABLE 3 


Critical Sample Size estimates for All Matrices and for Matrices 
Without Suppressors 


Matrices With- No. of 


p All Matrices No. of Matrices out Suppressors Matrices 
Я А 10 40 8 
= т 10 256 4 
0 2 10 105 4 
5 Ed 10 142 2 
10 UE 10 194 1 


Е ыш нын ы RE t шш 
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to regression weights when the sample size is 25, regardless how few 
predietors are used. Even when only two predictors are used, а 
researcher can expect simple unit weights to be .0133 correlation 
points superior, on the average, to regression weights. When 10 
predictors are used, the expected advantage for unit weights is 
.1269, a difference large enough to be of practical significance in 
many applied settings. The average superiority of unit weights over 
all p-values for sample size of 25 is .0578. Although the inferiority- 
superiority relationship is generally reversed when N = 50, the 
advantage of the regression weights is extremely small or even 


negative. Over all p-values this advantage is .0028. 
The elimination of X, matrices with suppressors greatly reduces 


the matrix sample sizes upon which the values in Table 2 are 
based, but, because of the fact that suppressor variables are rarely 
used in applied differential psychology (e.g, see Adkins, 1946), 
these data are more relevant to the purposes of this study than the 
data in Table 1. In the suppressorless matrices, the relative ad- 
vantage of regression weights—both sample and population 
weights—is less than it is in the entire sample of Xa, matrices, At 
а sample size of 25, the mean superiority of unit weights across 
p-value is 0826, quite a respectable difference. When № = 50, 
unit weights are superior by .0284 correlation units, on the aver- 
age. When N = 100, unit weights are still superior in three out of 
the five p-values and as an average across all p-values. 

For the entire sample of 50 Sey matrices, across all levels of p, 
the sample size below which the applied psychologist can expect to 
suffer a decrease in the size of his obtained correlation as a result 
of using regression instead of unit weights is about 46. For only 
the 3,, matrices without suppressors, this value is 85. If we ar- 
bitrarily assume that .0150 correlation units is the minimum in- 
crease in predictive power, for most practical purposes, that wil ` 
render the computation of regression weights worthwhile, then for 
the entire X,, sample, averaging across levels of p, the minimum 
Sample size needed is 60. For the matrices without suppressors, 
this minimum average sample size is 184. The implication of these 
latter figures is that, when applied psychologists are not actually 
Suffering a loss in predictive power as a result of using regression 
Weights, they are very often employing this complex statistical 
technique when it is probably a waste of time and effort to use it. 
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The “critical” or “break-even” sample sizes presented in Table 
3 by number of predictors provide more detailed information 
about potential losses in predictive power resulting from the use 
of regression weights. Assuming that suppressors are not to be 
used, if an applied psychologist has only two predictors, he needs 
on the average, a sample of only 40 to insure no loss of predictive 
power from the use of regression techniques. If he uses six predic- 
tors, this figure jumps to 105. And if 10 predictors are*used, he 
needs a sample of about 194. And it should be remembered that, 
for reasons discussed earlier, these “critical sample sizes" are 
best considered underestimates. Actual sample sizes needed to in- 
sure equality of performance of regression and unit weights are 
probably somewhat larger. It is not difficult to find in the differ- 
ential psychology literature studies employing regression weights 
in which six or more predictors are used and the total sample size 
is below 105. 

This finding of potential loss is apparently new. Many of the 

previous studies in this area raised the question of whether regres- 
sion weights were really more efficient than simpler weighting 
methods, but none investigated, or even suggested, the possibility 
that the use of regression weights could result in a reduction in the 
size of obtained correlation. 
Since many studies employing regression weights reported in 
the literature are characterized by sample sizes below the critical 
values presented in Table 3, it is concluded that many psychol- 
ogists in applied areas are routinely penalizing themselves by 
their adherence to this Statistical technique. The suggestion by 
Lawshe and Shucker (1959) that many psychologists are blinded 
to the possibilities of error in regression weights computed on 
small samples by the "apparent mathematica] precision" of re- 
gression techniques appears to be borne out. 


APPENDIX A 


The Mathematies of 'The Monte Carlo Method: The Bartlett 
Decomposition Of The Wishart Distribution. 
(1) Let 3,, = Xx" 
E can be factored into XX’ in a number of ways, all of which 
will work in this problem. In the program developed for this study, 
the factoring of 3, was approached via the roots and vectors method. 
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(2) Let A bea (p + 1) X (p + 1) matrix defined as 


А = Y/nTT', 


Where Т = а (p + 1) X (p + 1) lower triangular matrix, whose 
lower triangular elements are independent random variables: 


Ta ( > j) are N(0, 1) 
T,,arex(N — 1) 
« Ti @<j) = 0 
A is a sample variance-covariance matrix from a population where 
Ba = I, and x(N — 0 = VEN — i) 
(3) Let C = NXAX' = XTT'X' 
Then 8 = 1/NC 
S is a sample variance-covariance matrix from N (р, Xay) and is 


distributed as W (N, p, Say). 
(4) С can be converted into R: 


R = (Diag [C]) "^ C(Diag [C] ^ 
Then R is distributed as the maximum liklihood estimate of zy 
based on samples of N observations from a p 4- 1 normal distribu- 
tion with а population correlation matrix, X,, (Browne, 1968). 

(5) Generation of the Т matrix. There are a number of tech- 
niques for generating pseudorandom numbers from a rectangular 
distribution and transforming these into random normal deviates 
and random numbers from Chi distribution required for the Т 
à matrix (e.g., see: Hull and Dobell, 1962; Tausky and Todd, 1956; 
Muller, 1959; Teichroew and Sitgraves, 1961). In this study, the 
pseudo-random numbers were taken from a subroutine in the core 
of the 6500 CDC computer, called RANF (X). 


pseudorandom normal deviates, дь by the expressions (Box 


\ (a) These pseudo-random numbers, w;, are transformed into 
and Muller, 1958): 


gx = (—2 log, w,)’” Cos 2rw;+1 
Jerr = (—2 log, w;)^ Sin 2rW;+1 


(b) Pseudo-random numbers x.(f) from a Chi distribution with f 
degree of freedom are obtained from: 
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1 
xp = (-2 E tos. •) 


х@ + 1) = (-2 t log, w, + e) 


(Box and Muller, 1958; Kendall, II, 1946, pp. 132-133). 

(с) Once generated, random normal numbers can be filled in at 
random in the lower triangular cells of T. Random elements · 
from the Chi distribution are filled in on the diagonal of T, 
but only in the order of the degrees of freedom of the distribu 
tions from which they are drawn [See (2)]. 

(d) Note: (1) For any p, N, and Z,,, a different T matrix must 
be generated for each R matrix sampled, and thus 
for each p(J) computed. 

(2) The program was set up to print out the last random 
number from the RANF(X) function at the end 
of each run. This number was then punched into 
а card and used to re-enter the random number 
Sequence at the start of the next run. Thus each R 
was generated from a different random number 
sequence. 
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TRUE SCORE THEORY: A PARADOX 


J. O. RAMSAY 
McGill University 


CLAssIcAL mental test theory, as set down by Guliksen (1950), 
was thought to depend fundamentally on four assumptions: (a) 
The mean of error is zero, (b) the correlation between true score and 
error is zero, (c) the correlation between errors on two oc- 
casions with the same test is zero, and (d) the correlation between 
true score on one occasion and error on another is zero. These as- 
sumptions are held to be true in the population of test scores and in- 
dividuals. More recently, however, Novick (1966) has shown that 
the last three assumptions follow from the first and the assumption 
of linear independence of error over testing occasions for a particu- 
lar individual. 

Although this work has clarified greatly the foundations of mental 
test theory, it is now clear that there are few assumptions more 
ubiquitous in any body of theory than the notion that expected ob- 
served score is equal to true score. Lord and Novick (1968) dis- 
cuss this relationship in some detail and reach the conclusion that 
if there is no a priori reason for accepting this statement (i.e., no 
platonic true score), then it is usually not unreasonable to simply 
define true score as the expected value of observed score. It is im- 
portant to ask whether there are consequences of this assumption 
that run against intuition or established tradition. This paper will 
attempt to show that there are such consequences. 

Test reliability is defined in one of two ways. Tf it is defined as 
the correlation between test scores on one occasion with those on 
another completely independent occasion, then a direct consequence 


is that it is the ratio of true score variance to observed score vari- 
ance, or: 
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p= f q 


Alternatively, it may be defined directly as (1), which may also be 
expressed as | 


e EH 

PU e t e. 
where ej? is error score variance, 
Since all test scores are distributed on а closed interval in prac- 
tice, and usually in theory as well, it suffices to consider the be- 
havior of reliability for scores distributed on the interval [0, 1]. 
Test scores of this kind ean have zero reliability only when true 
score variance is zero. That is, error score variance must be finitely 
large. То make this more clear, e, can be expressed in the follow- 
ing form: 


@ 


«# = е,'— е? 
= Баг (z | ) + Var [E(z | 0) — о! 
= E|Var (z | 0] G 
by using & fundamental theorem about variance and noting that 
Eí(z | t) = t. Now the variance of observed score for a par- 
ticular true score is very definitely limited. Moreover, the con- 
straint that the mean of this distribution must equal the true score 


bounds this variance by that of а Bernoulli variable with mean 
equal to the true score. That is, 


Var (210 € (1 — 0. 4 

In order to see just how different from zero this lower limit on 
reliability is apt to be in practice, it is necessary to propose some 
more or less realistic models for the distribution of true score it- 
self. The logical candidate seems to be the beta distribution which 
can be made to look like the uniform or bernoulli but also which can 
reproduce the kind of unimodal distribution of scores that опе | 
expects in practice. Therefore, let 


I19 = 


1 LI 
Batol? e 
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1 
Котта 0 ® 
c and d, іп (6) are chosen to give g(t) а mean of 
of от by using 


d а variance 


E. 
a= thse) + ہم‎ - 2 m 
c= Lowe rm (8) 


| parameters, а and b, are chosen to provide f(z | £) with а mean 
f and the maximum possible variance by using (7) and (8) and 
j fact that the maximum variance when a > 0 and b 2 0 is 


01-0 0<:1<.5 

сл" < rae @) 
MO, 5<t<10 
2—1 ' [ 


on for the expected observed score variance is then 


War eo = f [ e — e оо ae at 


| 1( + 2t(a + 1) 
[itera еу + ejno a (10 


| this is evaluated numerically and substituted along with ef into 
), the result is a “realistic” lower bound on reliability as а fune- 
1 of true score mean and variance. It is not, of course, the abso- 
0 bound on reliability, but it is a lower bound consistent 
wha intuition demands of the distribution of true score and 
i with the assumption that expected observed score is 
ual to true score. 
e 1 shows this lower bound as a function of true score vari- 
ıa mean true score of 0.6. Translated into classroom testing 
e, it says that if the average student's true score is 60 per 
d the standard deviation of true scores is 15 per cent, then 
€ test cannot be less reliable than 0.28. If the standard deviation 
| true scores is 20 per cent, then test has a reliability of at least 
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RELIABILITY 


04 .08 12 16 20 24 
TRUE SCORE STANDARD DEVIATION 


Figure 1. A lower bound on reliability as а function of true score standard 
deviation. 


There seem to be essentially three ways out of this paradox. The 
first is to work only with scores transformed so as to be distributed 
on an infinite interval. However, this seems to make the concept of 
true score even more artificial than it was when defined to be ex- 
pected observed original score. Secondly, we may cast about for 
some set of assumptions to replace this one. This, of course, runs the 
danger that the resulting test score theory will be much stronger 
and have many more parameters than we might wish either to 
handle computationally or even admit theoretically. Finally, we 
may abandon the whole enterprise of describing test score behavior 
out of a purely predictive context and rely on standard statistical 
methodology to relate one test to another. Few are likely to favor ап 
approach as radical as this. 
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A TEST OF THE TRAIT-VIEW THEORY OF 
DISTORTION IN MEASUREMENT OF PERSONALITY 
BY QUESTIONNAIRE 


SAMUEL Е. KRUG! лхо RAYMOND B. CATTELL 
University of Illinois 


response set (Cronbach, 1946), multi-method (Campbell and Fiske, 


1959) and social desirability (Edwards, 1957) conceptions. Recently 
Cattell and Digman proposed subsuming most of these under a 
comprehensive perturbation theory (1964) with corrections indi- 


cated for questionnaires and ratings according to the specialized 


form of perturbation theory called trait-view theory. 
Trait-view theory proposes to consider the distortion as itself the 
product of the personality factors one is out to measure, together 
With effects of role adoption tendencies more specific to the testing 
situation. The weights of the personality factors and dynamic role- 
identification factors will vary with different motivational situa- 
tions in testing. By finding their weights, and storing them for a 
sufficient variety of testing situations, it should be possible for the 
Practicing psychologist routinely to correct scores for the above 
ions. 


Description of Trait-View Theory 


In psychological terms the theory states that an individual's mis- 
Perceptions of himself on a particular trait is a function of that 
trait and all his other traits, plus motivation specific to the par- 


+ 
mao at the Institute for Personality and Ability Testing, Champaign, 
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| 
АттЕмРТЗ to correct questionnaire scores for distortion and in- 
strument factor ("content") effects have been largely based on 
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ticular role. For example, a person high on super-ego strength 
(G on the 16 Р.Р.) might, in the first place, understate his own 
excellence on that trait, and an intelligent person might more 
clearly recognize the dominance motivations in his own behavior 
and thus overscore, relative to an individual who is less intelligent, 
on a dominance scale. Again, in responding to а question which in- 
dicates his degree of unreleased tension, a person with a very highly 
developed self-sentiment might be unable to recognize the full ex- 
tent of his tension, since it conflicts with his self-concept. He might, 
therefore, give an estimate of his behavior which is well below the 
true situation. 
Stated as a model, we have, in specification equation form: 


Si = Мы Т» Hoes ЫТ + e+ Ты + b, Ti 


where Ti represents the individual's score on the true factor c (as 
distinct from his obtained score, Sj, on any trait scale) and the 
several T, terms represent the individual's true scores on other 
traits which are involved with the performance on Sj. 

It is clear that what must be done in order to obtain a true mea- 
surement on the factor is to solve for the unknowns in this equation 
and thus provide a means both of estimating the true factor and of 
establishing which particular personality factors enter into the dis- 
tortion between T, and Sj. 

It is apparent that if an individual is given a psychological test 48 
part of a job selection procedure, the greater his interest in obtain- 
ing the job (and the greater his knowledge of what is required), 
the more his answers will tend to come in line with his perception of 
the required ‘type.’ Similarly, a mother who is required to com- 
plete a questionnaire or answer some questions which are designed 
to help a school counselor advise her child more accurately, may 
tend to give completely accurate reports or somewhat jaded reports 
depending upon how ‘motherly’ she is feeling at the time. The final 
result in each case is a personality profile which, for one reason 07 
another, is distorted. 

Role perception almost constantly has a powerful distortion ef- 
fect in personality questionnaires. In an analysis of the concept 
and measurement of role behavior, as distinct from personality, Cat- 

tell (1963) has proposed that role be recognized as an additional 
factor beyond the usual personality factors. It differs from а general 
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ity factor in that it enters only a relatively specific subset 
jons, whereas the former shows across all roles. 
ng the role theory requires us at this point to adopt two sup- 
@sitions, neither of which has as yet actually been investigated, 
he first says, as above, that we should add а measure of role be- 
or in the statistical form of an ordinary primary personality 
or—actually a dynamic factor and a relatively specialized kind 
hat can be called a role factor, R. It is possessed by those who have 
xperience in the role to varying degrees and not by others. The 
cond supposition says that unlike a personality trait it is *mod- 
ted’ in level by the situation itself, that is, the situation can 
te it to new levels. We may add thirdly, as a practical test 
proposition, that it is especially desirable to measure this role fac- 
lor by objective (decidedly less distortable) motivation measure- 
mt devices (Cattell, Radcliffe, and Sweney, 1963). 
this first investigation we do not propose to separate the con- 
ions from (a) the role strength and (b) its modulation in the 
situation. It suffices for the aspect of trait view theory here 
ed if we measure the total effect of role involvement as rep- 
d by T in the equation: 


>, 


Sane = ЪТ + + БТ baTu + basis 


Design of the Ezperiment 


he aim of the present research is to test the hypotheses: (1) 
Ш the score of an individual on a scale set up to measure a 
n source trait will be significantly determined not only by his 
level on that source trait, but also by the influence of other per- 
ity factors in a characteristic and meaningful distortion of the 
f-evaluation in the test-taking behavior, and (2) that the weights 
personality factors influencing the score will differ according 
he definition of the test-taking situation, and that when this sit- 
tion includes instructions which give rise to the adoption of a 
le, a distinct role factor will also emerge, influencing the total 
. As a third question, but in the applied field, it is asked 
ether the true score can be better estimated in a given situation 
"employing (а) a measure of the extent to which the role factor 
ates for each individual, and (b) a measure of the role factor 
er with weighted scores on the other personality factors. 
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It will be asked at the same time whether the estimation of the pure 
source trait factor is different from that obtained simply on the 
usual scale for the source trait itself. 

The design for the experiment naturally required that the meas. 
ures for a sufficient number of source traits be employed both in 
order that the factor equation be soluble and in order to deal with a 
really substantial part of the total personality effect upon any sin- 
gle scale. 

For this purpose, the Sixteen Personality Factor Questionnaire— 
16 P.F.—(Cattell, Eber, and Tatsuoka, 1969) was selected, since it 
samples sixteen dimensions of the personality sphere. Moreover, in 
order to insure greater reliability, two forms of the test, A and B, 
were employed. 

The subjects were 159 first- and second-year undergraduates from 
four colleges in the United States. 

The design of the study further required that the same trait scales 
be administered under different conditions or role-involvement sit- 
uations, four in all. The four test-taking role situations were se- 
lected in part for their theoretical interest and in part for their 
practical importance and their adaptability to an experimental 
setting. In each situation, the subjects were given both forms of 
the 16 P.F. and a short, objective motivation measurement scale 
which was constructed to assess their degree of role-involvement 
in the particular situation. 

The design of this role-involvement measure utilizes the findings 
of a long series of experiments on objective motivation measure- 
ment. Approximately half of the items had already been used in the 
Motivation Analysis Test (Cattell, Horn, Radcliffe, and Sweney, 
1964) and were of proven validity against the career and other 
factors therein. The rest were constructed to meet the particular 
needs of the present study. 

The following four situations were included in the present study: 

(1) The standard cooperative research situation, Since many in- 
vestigations are done under conditions where the subject is simply 
sitting for an experiment, is guaranteed anonymity, and is assured 
absence of any consequences from his performance, this is an im- 
portant standard situation to be evaluated. 

(2) The job seeking situation. In this situation, the subjects 

were instructed to fill out the questionnaire as part of a study to 
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valuate their potential for a future job in teaching. More than 
alf of the subjects were education majors, so this was not merely 
acting” а role. They were told that the results would be used by 
college of education to determine the effectiveness of their 
draining. In this case, the individual is filling out the questionnaire 
їп such a way that another individual or institution is evaluating 
| and it is reasonable to assume that such a situation will in- 
“troduce an element of ‘desirability’ faking or ‘stereotype’ faking. 
Although the particular type of job selected may dictate specific 
` aspects of distortion, nevertheless, one may assume that some gen- 


"tion to the job seeking situation in general. 

— (8) The ideal self-distortion. This situation was selected partly 
` because of its theoretical interest for general research on the self- 
concept and partly to check on the implied assertion above that 

the job secking situation has many specific elements and is not 
identical to a situation in which the individual gives the best pro- 
file he can of himself. In other words, there should be some differ- 
‘ence between the individual's generalized ideal, and his estimate of 

the ideal required for a particular situation. 
(4) The ‘Operation Match’ or Prospective Dating situation. In 
this situation, the subjects were asked to fill out the questionnaire 
_ for a study being done by Operation Match, a computer dating 
service. Again, this situation introduces desirability faking of а 
__ type—that involved in appearing attractive to the opposite sex— 
| but it is assumed that this would be somewhat different from that 

eneountered in either (2) or (3) above. 
It should be recognized incidentally, that even the standard situ- 
- ation (1) is not conceived to be free from distortion in terms of the 
theory which is being employed here. 


Analysis of Data 

From each subject, 72 scores were obtained, 16 trait scores (Form 
А and Form B scores combined) and two role scores on each of four 
Occasions. Principal axes factors were obtained from the unreduced 
correlation matrix. The roots were plotted to determine the num- 
ber of factors to be retained for further analysis, as indicated by 
the Seree test (Cattell, 1966). Twenty-one factors were indicated 
and retained. Convergent communalities were estimated for 21 
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{actors by traditional iterative principal axes procedures. The og- 
thogonal, principal axes solution was rotated to the first approxi. 
mation of the oblique, simple structure position by a Procrustes 
procedure (Hurley and Cattell, 1962). 

Following this, graphical rotations were performed to approach 
the simple structure position with somewhat greater precision. Fin- 
ally, the Maxplane program (Cattell and Muerle, 1960) was used 
to “clean-up” the structure, The ultimate hyperplane count was 
763 per cent, which is high among published researches and indi- 
cates а sufficiently stable and, therefore, psychologically meaning- 
ful position? 

Examination of the factor patterns for the 16 scales shows the 
presence of those differences in the contributions of various person- 
ality factors to particular scales that the theory would require. 
However, due to present disputes about the evaluation of the sig- 
nificance of the difference between two factor loadings it is not easy 
to give an acceptable appraisal of the significance. If, using the 
same principle аз in Harris's (1965) test for the significance of а 
loading we consider these as partial correlations and calculate ap- 
proximately, considering the average inter-factor r to be the gen- 
eral interfactor r, a loading significant at the .05 level would be 
approximately .16. By the same principle differences of .10, .08, and 
-06, with the lower r at .05, .10, and .15 respectively would be sig- 
nificant at the .05 level. 

Psychologically, it is interesting to note that among the suggestive 
differences are: (a) a tendency for the outgoing warmth of factor 
A to favor higher estimation of factor O (worrying, guilt-prone) 
more strongly in the dating than in the job-seeking situation, (b) 8 
tendency for the dominance (E) factor to favor higher estimation 
of factor Н (venturesome) in the job-seeking than in the anony- 
mous situation, and (с) a tendency for high ergic tension level 
(Qu) to favor higher estimation of factor О in the job-seeking than 
in the self-sentiment situation. It is not difficult to generate psy- 
chological theories for these, from the understanding of the general 
nature of the personality source traits and these situations. 


? The complete factor pattern matrix has been deposi i ional 
E ری‎ B eposited with the Natio 
M) To pHa nene М Information Corp. 909 Third Avene 

‚ N. Y. 10022. er Document No. iti 4 
for photocopies or $200 for microfiche. E n ЫШ 
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In addition to this new analysis according to trait-view theory, 
by factor analysis, we made the more traditional comparisons to 
determine whether there were significant differences in mean scores 
en each factor over the different role conditions. The results of sep- 
arate t-tests for each variable between raw scores in each condi- 
tion and the mean of all four conditions, which may be thought of 
as an averaged role condition, are given in Table 1. While these 
tests may not be considered conclusive, because of the possibility of 
correlated error terms, they may at least be considered indicative 
of real shifts among the set of role conditions. The similarity to the 
shifts under the ordinary conscious “fake good” instruction may 
be seen from column 5 in Table 1, taken from the 16 P.F. Hand- 
book (Cattell, et al., 1969). 

In order to test the third hypothesis, that there is practical util- 
ity to the theory in that the true factor may be better estimated 
by using scores on role variables as well as scores on other person- 
ality variables, estimates of the true factor scores were next made in 
various ways, separately for each of the four test-taking conditions. 
First, the factor was "estimated" in the ordinary way, simply 


TABLE 1 
Diferences in Personality Scores of Each Role Condition From the Mean Role Condition 


Under instruc- 


Condition ^ Condition Condition Condition Mean Role tions to 
1 2 3 4 Condition “fake good’’* 

A 5.7% 6.9" 7.3% 6.7 6.7 1.1 
B 6.7 6.6 6.4 6.7 6.6 6.0 
c 5.4** 13 8.3** 1.1 1.0 1Л 
Е 5.9 5.7 5.3* 5.9 5.7 5.6 
F 6.2% 5.5% 6.1 6.2 6.1 6.5 
G 5.9** 7.3% 7.7% 5.8% 6.5 6.9 
A 5.5** 7.4 т.8** 1.2 7.0 1.9 
L 6.0 6.2* 5.8 5.6 5.9 4.9 
x 5.6** 4.0 3.5°* 4.1 4.3 4.2 
N 6.3** 5.8 5.2** 5.6 5.7 4.9 
0 5.3** 6.4 6.7** 6.1 6.1 6.8 

5.9** 3.6** 2.9** 4.0 4.1 3.4 
Q 6.0** 6.9 7.0* 6.7 6.6 6.6 
$ 5.6** 4.8 4.3%* 4.7 4.9 4.1 
Q 5.2** 7.7% 8.1% 6.6 6.9 1.7 
& 5.7** 2.9** gp es 3.5 3.6 2.5 


UNE pn Me axpremed hein urms of the ston scale, havinga mean of 55 and a standard deviation of 2. 
LI 05. the Handbook for the 16 PF (Cattell, et al., 1069). 


Т3 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


from the seale which was designed to measure it. The second esti 
mate was made by taking the regression of all the personality vazi- 
ables (only for that situation) on the particular true factor source 
trait? The third estimate was made by taking the regression of the. 
appropriate personality variable only on the two role variables for 
that condition. Finally, the factor was estimated by taking the 
regression of all personality and both role variables, i.e., combining 
three and four described above. 

In the first case this amounts simply to taking the correlation 
between the scale and the pure factor as calculated from the factor 
structure matrix. In the last three cases, the degree of validity is 
indicated by the multiple correlation coefficient between (a) per- 
sonality variables, (b) role variables, and (c) both of these, with 
the pure factor. These values are shown in Table 2. Summed values 
from Table 2 represent actually the means when the correlations 
were transformed to Fisher Z coefficients and back to r’s. On these 
Z values, analyses of variance were carried out separately for 
each role condition to test whether there was a significant difference 
in factor estimation among the four approaches. A single factor 
design for correlated between group observations was used (Winer, 
1962). After determining that the individual F ratios were sig- 
nificant, differences between condition means were examined sep- 
arately by the method of Scheffe (cf. Hays, 1963). The results of 
the analyses of variance and the post-hoc comparisons are presented 
in Table 3. 


| 
| 
Discussion and Interpretation | 
| 


} The first two hypotheses stated earlier, (1) that the score of an 
individual on a factor source trait will be determined not only by his 
actual seore on the scale for that trait but also by other personality 
factors, and (2) that the weights on these will differ according to 


* The reader should be reminded of the distinction, here and in Table 2 
vm the aim of obtaining a correction for a role as here, and the aim of 
at Eber and Cattell (1968) have called computer synthesis scoring Of 
variance allocation. In the latter it is assumed that regardless of test situation, 
To d scales contain displaced variance that should be returned to other scales: 
о do that one averages across all possible role situations. In the present 
method one uses specifically the weights for the known test taking situation. 
а ыы А а calculation, a constant peculiar to the equation needs t0 
poan с Ld rid or the mean shift of everyone in going into а situation, 3$ | 
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TABLE 2 
Yalidities of Source Trait Estimation in Four Test-Taking Role Situations 


3 

76 79 ът 79 
бо 72 81 85 85 
755 8! 6 74 15 
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Situation 3—Ideal Self Situation 4—Dating 
Method 

A Mieetothymia e 65 6 6 55 в и 6 
оазе si в 8 85 s; 8 87 88 
Ego Strength 65 п 6 т пзп тп тп 
Dominance 47 62 48 и зве е 64 
роту з 50 4 50 и б и € 
& барит Ego 30 50 30 49 6 67 6 67 
үрөн 17 49 35 & т 4 4 47 
грама: 6 n e m з 86 & 8 
Мнса 58. 67. 56, BT па 7 SM 
гүш e 7 67 1 51 56 50 57 
00 B4 65 5 65 56 68 59 68 
A veda 48 62 50 6 02 50 10 50 
d nonien et oper ee 7з 78 7 78 
фы deny 68 74 69 78 d 68 6 67 
фо бецйшеп 66: 70 66, 7i 358 46 37 47 
56 65 57 66 и 10 6 т 


оо Methods: 1 = Ordinary scale scoring; 2 = Using all personality factors; 3 = Using dyanmic role 
i4 = Combining 2 and 3. 


Note.—AIl multiple correlations have been corrected for shrinkage. 
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TABLE 3 
Mean Differences in Factor Estimation Procedures: Results of Analyses of Variance 
and Post-hoc Comparisons 
————== 
Condition 1- Р = 21.00** 
Personality Role Personality + Role 
Vn +.07"* +.02°* T.07** 
Personality -.05** .00 
Role + .05°* 
Condition 2- Р = 25.00** 
Personality Role Personality + Role 
Lm +.08°* +.02* +.10°* 
Personality —.06** +.02 
Role +.08** 
Condition 3- Р = 50.00** 
Personality Role Personality + Role 
Vis +.09°* +.01 +.10°* 
Personality —.08** +.01 
Role +.09** 
Condition 4- F = 26.67** 
Personality Role Personality + Role 
Yr. +.06*'* +.01 +.07%* 
Personality —.05** +.01 
+.06** 


table 
the method designated by the column heading and the method designated by the row heading. 
Thus, under condition one, .07 is the difference of th» method using the regression of all personality 
factors ín estimating the pure factor and the ordinary single scale (У/.) approach, the fomer being 


the test taking situation and will include contributions specifically 
from a role factor, can be considered together. 

The Source traits operation in the 16 P.F. can, on the whole, be 
readily identified by the marker variable loadings. Factors 1 
through 16 were hypothesized to be the source traits measured by 
the 16 P.F. With the exception of factors 7 and 12 (source traits H 
and O) all have their highest loadings on the appropriate marker 
variables. Factor 7, while not having its highest loading on the 
marker variables, does have reasonably high values. In addition, 
other variables which are highly loaded by this factor, such as Е 
in the general experimental role, are known from previous researches 
(Cattell, 1957) to show these moderately high loadings. Such inter- 
correlations among primaries in fact provide the basis for the sec- 
ond-stratum pattern of anxiety, such as C, L, and H. 

The patterns expected for the role factors are somewhat less clear. 


represent the algebraic differences in mean validity between _ 


—————— г ө өе є 
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Factor 18 is fairly obviously a career role factor as indicated by its 
marker variables. Factor 20 can be interpreted as the role factor 
for the Operation Match (Dating) situation from its loading of .81 
on the first role variable for that condition. Although there is a 
higher loading of .83 on variable E in the Dating role factor, we 
might reasonably hypothesize that in males, at least, dominant, 
assertive behavior is part of the mate-finding pattern. 

Factors 17, 19, and 21 still remain unidentified in the factor 
matrix. The means of the absolute values of the loadings of factor 
17 for each of the four situations are 276 for the general condition, 
086 for the career situation, .121 for the self-sentiment distortion, 
and .131 for the Operation Match condition. The relative strength 
of its appearance in the variables measured in the general condi- 
tion suggests a tentative identification of factor 17 as a general 
experimental role factor. For factor 19, the same mean values are 
111, .186, .209, and .183. Noting that the highest loading of 19 is in 
the ideal self-situation and other high loadings are on G(+), 
E(—), and C(+), we tentatively consider 19 а self-sentiment 
role factor, i.e., a strength of interest in the ideal self. 

A factor such as 21 was originally hypothesized to appear purely 
as an instrument factor which would have positive loadings on the 
tole variables (because they are measured by objective motivation 
devices) and negative or zero loadings on the personality variables 
(measured by questionnaire). In the final rotation it would only 
Toughly fit the hypothesized pattern. 

In testing the second hypothesis: that significant and significantly 
different weights will be found on other factors than the source 
trait being measured, as the test-taking situation changes, it is 
necessary to bear in mind that the search for simple structure will— 
as far as errors occur in rotation—operate in the direction of re- 
ducing the incidence of such values. However, it cannot do so be- 
yond a certain point, for the nature of simple structure is such that 
reducing some would tend to raise others. We should recognize also 
that in estimating a pure trait, we are concerned with the Уу, not 
the V, In other words the weights of the various scale scores in 
estimating a true souce trait are what we need to compare from 
situation to situation, from the practical test point of view, and 
these will respond to changing correlations of the traits from situa- 
tion to situation too. Whether this latter kind of change also occurs 
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is a matter for later and larger researches to settle. A rough idea of 
the amount of role involvement may be gleaned from the fact that 
the average number of significant loadings per variable is four 
(not one) in the role taking situations. 

That the contributions of the source traits to determination of 
scale score change from role to role was mentioned above in ex- 
amining successive rows of the factor pattern matrix. The contribu- 
tions of source traits E and G to scale A change direction from the 
first condition to the second condition. This change in direction of 
loadings was evident throughout the matrix. The change in value 
of the marker loadings is also quite interesting. The variance of 
scale Н (venturesome) in condition two seemed to be taken up 
primarily by factor E (assertiveness) and factor O (confidence), 
which makes a good deal of sense in the job seeking situation. 
Other examples are factor E, which seems to operate more force- 
fully on seale E in the career setting than in the self-sentiment 
distortion, or factor F (enthusiastic) which operates more strongly 
in the career situation than in the Operation Match situation. In an 
attempt to find how significant these changes are, we used, in ad- 
dition to the pair-wise comparison of loadings mentioned above, the 
Cochran Q statistic. This evaluates whether the true proportion of 
salient loadings was constant across all role conditions for each 
variable. Only one variable, 0» (group dependence), showed a sig- 
nificant shift (0 = 18; р < .001) in the number of salient load- 
ings across the various conditions, but this does not negate the 
pair-wise findings. 

The final hypothesis to be examined is that a more valid esti- 
mate of any source trait factor can be made by taking the regres- 
sion of other personality and motivation scales (along with the ap- 
propriate personality scale) than can be done from the single scale 
itself. Table 3 above shows the outcome of this investigation. 
Four methods are actually compared here: (1) the simple scale, (2) 
using other personality scales, (3) using role factors, and (4) using 
(2) and (3) together. The results of the post-hoc comparisons 
among the means of the four methods suggest a general tendency 
toward improvement of the factor estimation by using other per- 
sonality variables. Some notable examples are factors E, Н, and О 
in situation one, factors E, F, M, and О in situation two, factors 
A, E, G, N, and O in situation three, and factors A and O in situa- 
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Чоп four. For the role variables the improvement із not quite so 
striking, although factor О is improved in situation one and two, 
factor Qs in situation two, and factor Н in situation three. 

The inclusion of both personality and motivation variables leads 
to a negligible and insignificant increase in estimation above that 
gained by personality alone. Two explanations are possible. The 
first is that the number of predictors is already so high that the 
information any new variable adds is redundant. A second explana- 
tion is that we should expect no great increase in predictability 
of a personality factor by the inclusion of motivation variables 
since the two domains have already been shown to be relatively in- 
dependent (Cattell, et al., 1964). In fact, this seems to be the case, 
since an inspection of the factor structure matrix showed low zero- 
order correlations between the personality factors and motivation 
variables. Although the two domains are apparently factorially in- 
dependent, it could, of course, be true, as Cattell, et al. (1964) have 
suggested, that the two combine additively in the prediction of 
most everyday life behavior. One would certainly expect from all 
common sense considerations that test-taking could be among the 
behaviors so affected by motivation. The most likely explanation 
is surely that we have not yet succeeded in designing the objective 
motivation batteries to center upon the motivation operative in the 
role situation. 
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DIFFERENCES BETWEEN THE MILLER 
ANALOGIES TEST SCORES OF 
PEOPLE TESTED TWICE 


JEROME E. DOPPELT 
The Psychological Corporation 


a 
Tn Miller Analogies Test (MAT) is a measure of scholastic 
"aptitude that is widely used by universities as one of the bases for 
selecting graduate students and by business and government as 
— ап aid in hiring high-level personnel. The test is administered in 
- licensed Centers which exercise strict control over the materials. 
_ The names of all examinees and their scores are reported by the 
a . Centers to The Psychological Corporation, the publisher of the 
МАТ. At the examinee's request, his score (or scores) on the MAT 
1 will be sent to specified institutions or individuals. Each year а 
| number of people take the MAT for the second or third time. 
, E study is concerned with the scores of people who were tested 


Although those who are retested with the MAT constitute & 
‘small percentage of the total number tested, they now number 
e than 1,000 persons per year. Individuals take the MAT 
e than once for various reasons. There are those who feel that 
r first score is not truly indicative of their abilities and hope 
prove their score on the second testing. Some who were tested 


ү, some people may simply have forgotten they were tested 
the MAT in the past and unknowingly apply for a retest. 
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From the point of view of the official who must evaluate an 
applicant’s MAT score, several questions regarding the results of 
retesting would seem relevant. How stable are the MAT scores? 
What kinds of differences are found when scores on first and sec- 
ond testings are compared? How does the time interval between 
testings relate to the difference between scores? Does the admin- 
istration of the same form twice yield results which differ from 
those obtained when different forms are given in the two testings? 
What is the relationship between initial scores and differences be- 
tween first and second scores? 

Answers to many of these questions have been sought by ear- 
lier investigators. The results of two studies published about ten 
years ago are of interest. Spielberger (1959) analyzed the data 
from three small samples (total N = 48) of psychology students 
and reported significant average gains between first and second 
testings, of approximately five points. Different forms of the 
MAT were used in the two testings. The time intervals were not 
specified for all cases, but they appear to have been relatively 
short. Spielberger also studied improvement according to score on 
the initial testing. He divided his cases into three initial-score 
groups and reported that the “effects of practice on the MAT 
scores were most facilitative for Ss with low initial scores and the 
amount of improvement was inversely related to initial MAT 
scores." The range of initial scores among Spielberger’s cases was 
45 to 88 with a mean above 70, indicating a relatively superior 
group. 

Coladarei (1960) studied the MAT scores of 56 male candidates 
for administrative positions in education. These men had taken 
the MAT twice with a median time interval of 15.8 months be- 
tween testings. The product-moment coefficient of correlation be- 
tween initial and retest scores was .82. A difference of 7.4 points 
between the means of the two testings was found. This was attri- 
buted to the experience intervening between the two administra- 
tions rather than to any practice effect, “in view of the modal in- 
tervals of time involved.” No significant correlation was foun 
between score gains and the time between testings. Coladarei 
also studied the relationship between initial score and improve 
ment score after classifying his subjects into three groups accord- 
ing to initial-score level. He reported that “in our sample there W4 
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relationship between the magnitude of the initial score and the 
nt of improvement, either with obtained scores or estimated 
scores.” Coladarci noted that this finding was in disagreement 
with that reported by Spielberger and suggested that this may be 
8 result of the different types of samples in the two studies. The 
таре of initial MAT scores in the Coladarei study was 15 to 82 
with a mean of 43. 
— The present study sought answers to the questions raised above 
by analyzing the data of two large twice-tested samples selected 
from the publisher's registry of scores. The first sample, which in- 
eludes 1,690 cases collected during 1959 and 1960, will be called 
‘the 1960 Sample; the second sample, consisting of 624 cases, most 
of whom were tested for the second time in 1968, will be identified 
as the 1968 Sample. It cannot be assumed that the samples studied 
in this paper are representative of all persons who are tested 
with the MAT, but it is reasonable to suppose that they are repre- 
` sentative of persons who voluntarily or by request apply for а 
retest, 

The basic data consisted of the score from each administration 
of the MAT, including the form! of the test given, the difference 
between the second and first scores, and time in months between 
testings. Each sample was divided into those who were tested 
with the same form of the MAT on both occasions and those who 
were tested with different forms. For these groups, correlations 
were computed among the scores, time intervals, and score differ- 
ences. The coefficients, along with the corresponding means and 
standard deviations, are shown in Table 1 for the 1960 and the 
1968 samples. 

In recent years an effort has been made to avoid administering 
аз а retest the form of the MAT that had been given as the first 
test. To this end, one form of the MAT was set aside to be used 
only for retesting. The result of this procedure may be seen in the 
smaller per cent of persons retested with the same form in the 1968 * 
Sample (20%) than in the 1960 Sample (41%). 

In the 1960 Sample, the “ваше form” and “different forms” 


eS 

1Forms G, H, J, and K were administered to the subjects of the 1960 
Sample; Forms H J, К, L, and R were administered to the 1968 Sample. To 
make Form G scores comparable to scores on other forms, two points were 
subtracted from the Form G scores in the range 32-72, as recommended in 
the MAT Manual. " 
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TABLE 1 


Intercorrelations among MAT Scores from First and Second Testings, Difference between 
and Time between Testings 


Score 


Diff. 
Second (Second Time 
MAT Minus Interval 
Score First) (Months) Mean 


Note.—For each sample, data above the diagonal are E 
сша: х ^ based on cases tested with different forms of tbt] 
eng ; data below the diagonal are based on cases tested with the same form of the MAT 


groups have very similar MAT means and standard deviations 
for each testing. The second test score is higher, on the average, bY 
about six to seven points. In the 1968 Sample, the first testing 
mean score of the “same form” group is much lower than that for 
the “different forms” group. There seemed to be no apparent rea- 
son for this finding. It is seen as a sampling variation, associated 
with the relatively small size of the former group in comparison 
with the latter group. The average gains in score, however, are 
similar for the two groups, and about two points greater than the 
differences found in the 1960 Sample. 

The average time between testings in the 1960 Sample is longer. 
for the “same form" group (25.7 months) than for the “different 
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forms" group (15.6 months). Table 1 shows that the time between 
testings in the 1968 Sample is considerably shorter than it was in 
the earlier sample. Although the time interval for the “same form" 
group is still found to be longer than it is for the “different forms" 
group, both figures are about half of what they were for the cor- 
responding groups in 1960. Thus, retesting with the MAT took 
place sooner, in the 1968 Sample, than it did about eight years 
earlier. 

The correlation between first and second testings may be re- 
garded as a coefficient of stability for the MAT. For the 1960 
Sample, coefficients of .89 and 87 are shown in Table 1 for the 
“same form” and “different forms” groups, respectively. For the 
1968 Sample the coefficients for the corresponding groups are 82 
and .86. The coefficient of .82 was obtained for the smallest of the 
four groups, the 123 people of the “same form” group in the 1968 
Sample. The standard deviation of first testing scores for this 
group was about 13 per cent smaller than the standard deviation 
of the “same form” group in the 1960 Sample (13.9 as compared 
with 15.9). This type of restriction may account, in part, for the 
lower coefficient that was obtained in the later sample. In general, 
the stability coefficients are high and similar to the alternate form 
reliability coefficients reported in the MAT Manual. This finding 
is notable when it is recognized that the average time between 
testings ranges from 8.6 months to 25.7 months, over the four 
groups. 

The correlations between the scores on either the first or second 
testing and the time interval between testings do not indicate 
strong relationships. For the four groups in the two samples the 
coefficients range from —.01 to .17. Since people apply for a retest 
with the MAT for quite different reasons, this finding might have 
been anticipated. 

The correlations of the difference between scores (second score 
minus first score) and the time interval between testings are also 
reported in Table 1 for each group in the two samples. Here, as is 
true of the correlations of either first or second scores with time 
interval, the coefficients are low and are unimportant from a prac- 
tical viewpoint. Further study of the relationship between score 
difference and time is summarized in Table 2. 

The time between testings was divided into six intervals (shown 
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TABLE 2 


Means and Standard Deviations of MAT Score Differences* According 
fo Time Interval between Testings 


Time Interval in Months 
Total 
Form 0-2 3-6 7-12 13-24 25-36 37+ = Grey 
—HBÀÓÀÀ— Е С eye e т Чч 
1960 Sample 
Same 

N 4 75 116 139 1 203 [1 
Mean 8.6 5.9 6.7 6.1 6.3 6.7 66 
SD 8.2 7.3 8.0 7.2 1:1 7.3 15 

Different 
N 30 126 169 168 99 134 997 
Men 7.1 6.1 6.0 6.0 6.0 3.8 6.0 
SD 8.7 7.6 8.3 7.3 9.9 9.0 8.5 

1968 Sample 
Same 

к mx 21 37 35 11 2 13 
Mean 10.9 тл 9.4 8.2 9.8 — 8.9 
SD 58 979 739.4 8.7 8.1 — 9.0 

Different 
N 19 78 83 145 6 4 501 
Men 84 8.3 7.3 8.2 = E 8.2 
SD 7.9 7.3 8.2 7.5 = -— 7.8 


"ed tenting op мне той computed when N was les than 10. 
in Table 2), and the mean and standard deviation of score differ- 
ences within each interval were computed. The average difference 
between scores for the total “вате form" group is only slightly 
greater than that between the “different forms" group in each 
sample. There is some indication in both samples that those re- 
tested with the same form show higher gains when the time be- 
tween testings is less than three months. Although the differences 
between the “same form" and "different forms” groups are gen- 
erally small, it nevertheless seems desirable to give a different 
form of the MAT when retesting. People who are retested shortly 
after their first testing probably have an advantage if the iden- 
tical form is administered in the second session. Furthermore, the 
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use of а different form, regardless of the time between testings, is 
more consistent with the idea of an independent evaluation than а 
repeat testing with the same form. 

In Table 1 the highest coefficients, aside from the coefficients of 
stability, are found between score differences and second testing 
scores, followed by the correlations between score differences and 
first testing scores. There is not much value in detailed study of 
score changes in relation to the score on the second teating since 
the chronology of the situation is not relevant to practical usage. 
It may be helpful, however, to consider the relationship between 
the score on the first testing and the change in score after retesting. 

As а practical guide to the interpreter of MAT retest data, а 
single table which indicates expected changes in score on retest, 
according to the initial test score, was prepared. Table 3 provides 
such information based on the combined data from the 1960 and 
1968 samples. The table shows the 75th, 50th, and 25th percentiles 
of the distribution of score differences, for various score levels on 
the first testing. With the exception of those who scored 70 or 
higher on the first testing, at least 75 per cent of the group at each 


TABLE 3 


Percentile Equivalents of the Distribution of Differences: between First and Second MAT 
Testings According to Score on First Testing 1960 and 1968 Samples Combined 


Score Difference Per Cent of Differences 
Percentile Which Were 


First Testin Range of 
8 [4 пре oi 


M 
N Differences 75th 50th 25th Positive Negative Zero 


—26t013 4 0 BW тү TU 

112  —101019 9 4 D aS 4 

60-69 263 —28t022 9 4 1 75 21 4 
50-59 419 —164027 11 6 pue omes 5 
40-49 508 — —231033 12 1 2 79 16 5 
30-39 470 — —16t032 13 8 2 8 14 4 

t 20-29 415 — —181044 . 13 1 aL 15 4 
Below 20 91 _ 8to34 15 9 мут 4 
Total 2314 281044 7 {= 09 17 4 


testing were obtained by 36 people: 
—26 to 13. In this distribution 
obtained ү — оа 
percentile of the distri ution of differ- 
en that the bottom quarter of the 
cent of the differ- 

were negati ve, 
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score level increased their scores on the second testing. (Examinees 
who scored 70 or higher on the MAT have demonstrated very su- 
perior test performance. One would expect fewer increases in score 
among people who achieved such high scores on the first testing.) 
Nevertheless, it should be kept in mind that a substantial number 
of people obtained scores on the second testing which were lower 
than their first testing scores. Table 3 shows that the “per cent of 
differences which were negative" varies from 10 to 39 when the 
examinees are classified by score on first testing. 

Although the range of differences between second and first 
MAT scores is enormous (—28 to 44), the middle 50 per cent of 
differences is contained within a band of 11 points (12 to 1). The 
average difference between test and retest scores is a gain of about 
seven raw score points. Evidence of а regression effect may be 
seen, Examinees whose initial MAT scores are between 50 and 80 
gain about five points, on the average, when tested a second time. 
Examinees with initial test scores below 50 show an average gain 
of approximately eight points, on retesting. The people in the very 
high-scoring group (those with scores of 80 or higher) show an 
average gain close to zero, while those in the lowest category of 
first testing score (below 20) show an average gain of nine points. 
In general, a gain in excess of 12 points is likely to be found in 
about one fourth of the group who are tested twice. 

Discussion 

The writer must admit to a certain amount of fascination with 
test-retest data such as those provided in this report. It is usually 
difficult to obtain the scores of a large number of people who have 
been retested after a considerable period of time. Various types of 
classifications can be made for study of the scores and score differ- 
ences, and some of these data have been presented in the fore- 
going tables. But then one is faced with the nagging, practical 
question: Now that we have these results, to what end can the ad- 
missions officer or the employer use them? It is at this point that 
the writers fascination with the data changes to concern, because 
no simple, clear answer may be given. Some suggestions are of- 
fered here. 

1f both scores of an individual are below the range imposed by 
the institution's policy, or if both scores are above it, there is 20 
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problem in deciding which seore to accept. However, if one 
score falls below а critical level and the second is above that level, 
& careful review of the situation is indicated. 

As a practical matter, it may be assumed that score differences 
"after a short time, say less than one year, are more associated with 
“measurement error than with important changes in the examince. 
Large differences over a short time interval may reflect an invalid 
testing on either the first or second occasion due to illness of the 
examinee, poor test administration or the like. When in doubt, a 
third testing might be requested. When the scores straddle a critical 
range, the wisest course would be to consider other information 
^ about the individual rather than routinely to accept either the first 
_ er second score or to average them. 

Over relatively long periods of time, intervening experience such 

as education may have contributed to a large gain in score. (It is 

possible, too, that a loss in score could be related to events in the 

time between testings.) When a long time has elapsed between two 

testings, the second score is more likely than the first to reflect 
accurately the current status of the examinee. 

It is obvious, but nevertheless important, that the score on а 
test should not be the sole basis for an important decision, such as 
admission to graduate school or employment by an organization. 
The individual’s previous record of accomplishment and any other 
relevant information must be seriously considered. The results of 
retesting should be evaluated judiciously. Determining how the 
test scores complete the picture offered by other available evi- 


| 
| dence is the most appropriate method for arriving at а satisfactory 


pi 


decision. 


Summary 

A study was made of the scores on the Miller Analogies Test of 
people who were tested on two different occasions. Two large sam- 
ples of cases were selected from the publisher's files. These cases 
may be considered representative of those who requested a retest 
with the MAT at the time the data were collected. In the first sam- 
ple, the retesting was completed no later than 1960, while in the 
second sample, the retesting of most cases Was done in 1968. 

The basic findings from the two samples are very similar. The 
stability coefficients for the MAT range from 82 to 89, with an 
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average gain of about seven points between testings. The correla. 
tions of the time interval between testings with the first or second 
scores, or with the difference between scores, are low and unim» 
portant from а practical viewpoint. Only small differences were 
found between the average gains of those tested with the same 
form twice and the gains of those who were tested with different 
forms. As а practical matter, retesting with a different form is 
recommended. The study of gains showed some evidence of a re- 
gression effect, with the larger average gain obtained by those 
with lower initial scores, 

A table which shows selected percentile equivalents of the dis- 
tribution of score differences, according to level of initial test 
score, was prepared from the combined data of the two samples. 
This table should provide general information to those interested 
in the effects of retesting with the Miller Analogies Test. A dis- 
cussion of the problem of interpreting score differences has been 
included. > 
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SETWISE REGRESSION ANALYSIS—A STEPWISE 
PROCEDURE FOR SETS OF VARIABLES 


JOHN D. WILLIAMS 4x» ALFRED C. LINDEM 
The University of North Dakota 


Serwise regression analysis is а new technique developed by the 
authors to allow a stepwise solution when the interest is in sets of 
variables rather than in single variables. Thus, the setwise regression 
procedure bears a strong resemblance to the stepwise regression 
procedure, There are, however, advantages to be gained by the use 
of setwise regression analysis, and a disadvantage of the stepwise 
procedure is overcome. 

A disadvantage of the usual stepwise procedure is that it becomes 
inappropriate when there are more than two categories being binary 
coded. A simple example can be made with religious affiliation. 
Four categories might be used: Catholic, Protestant, Jewish, and 
Other. Three binary predictors can be made with the first three 
religious affiliations, and the fourth category can be represented 
as not having membership in the first three categories. If religious 
affiliation were used in conjunction with other information, the 
stepwise procedure would not yield a valid indication of the im- 
portance of the religious variables. The setwise procedure, on the 
other hand, would allow a direct approach to such a situation. 

The setwise procedure drops one set of variables at a time in а 
stepwise fashion. There will be as many steps as there are sets. The 
Steps are accomplished by an iterative procedure that allows the R? 
(multiple correlation coefficient squared) term to be maximized at 
each step in a backward stepwise procedure, Once & set is discarded, 
the set is no longer considered at later steps. One set is discarded at 
tach step, until there is only one set remaining. 
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Input 

Data cards contain for each observation the criterion and predictor 
variables in any format or order. Parameter cards specify problem 
identifieation, number of observations, total number of variables, 
number of sets, eriterion variable, optional printout of data, and 
optional printout of residuals. Set selection cards specify the number 
of predictors in a set, and the variables included in a set. 


Limitati 


The maximum dimensions are as follows: 
99,999 observations, 40 variables including the criterion variable, 
and 10 sets of predictor variables, 


Computer and Program Language 


The program is written in FORTRAN ТҮ level F for the IBM 360 
(64K). 


Output 


The sets of variables remaining in the prediction equation are 
given for each stage. Also given for each stage is an analysis of 
variance for the regression, the beta and regression weights, the 
means and standard deviation for each variable, R?, R, 1 — R°, 
and the loss in the R? term for each stage. 

A printout of the program and sample output will be supplied on 
request, 
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A NOTE ON GENERATING MULTIVARIATE DATA WITH 
DESIRED MEANS, VARIANCES, AND COVARIANCES 


J. R. CAPRA лхо В. 8. ELSTER 


Naval Postgraduate School 
Monterey, California 


Tue problem to be discussed involves creating a set of л 
"observations on p variables, with the p variables having specified 
means, variances, and covariances. The method which will be pre- 
sented differs from those previously given by Kaiser and Dickman 
(1962) and Wherry, Naylor, Wherry, and Fallis (1965), in that the 


me does not use the models of principal component or factor 
Е Derivation and Procedure 


Let A be a p by n matrix of n independent observations on p 

variables. If the p variables are independent, with means of zero 

‘and unit variances, then, assuming the normal model A is distributed 

аз а sample from a multivariate normal population with a mean 
` vettor of 0 and a variance-covariance matrix equal to the identity 
` matrix. More succinctly, А is distributed as N (0, 1). 

- Anderson (1958, p. 21) showed that if one transforms A in the 
; following way: 

"x Z = СА, (1) 


then Z is distributed as N (Û, CIC’) or N(0, CC’), where C is a 

| P by p matrix used to transform A. Given a specified correlation 
© matrix R, the problem is to decompose R such that CC' = R, where 

T C will be a lower triangular matrix since Ё is symmetric. 

Ly one can derive C and if one has specified the correlation matrix 
, then a transformation exists which, when applied to A, will 
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give a set of observations on p variables with means of zero, 
variances, and the specified correlations among them. It is 
а simple task to apply linear transformations in order to achi 
the desired means and variances. 

The numerical technique for deriving C from R uses Cro 
factorization (Kunz, 1957, pp. 226-229). The following recurs 
which is easily programmed, allows C to be derived (Odell an 
Feiveson, 1966) : к 


Ca = Ra/ V Ra, 1<і<р 


EN — 5 Cs, 1<t<p 


i=l 
Cu = Ru — E CaCn|/ Cu, 1<j<i<p 
a! 


Cu = 0, t< j < p. d 
Ordinarily, a researcher will establish the n observations on eae 
of the p variables of A by using a random number generator. 
assumption made is that the random number generator will 
variables having zero means, unit variances, and zero intercol 
tions. Because random number generators yield these characteristit 
only in the limit, a researcher may wish initially to adjust A such 
that in fact the p variables do have zero means, unit variances, 
and zero intercorrelations. (Of course, this operation will somewhal 
distort the randomness of the sample.) Nevertheless, to allow this 
adjustment, if the user judges it to be desirable, the followin! 
option is available. 
Let X be a p X n matrix obtained from the random num 
generator, the sample mean vector being given by Z and the ва 
variance-coveriance matrix being given by M. The goal 
transform X into a set of data with sample means of zero 
sample variance-covariance matrix of I. By definition, 


XX' = M. 
Consequently, what is needed is a matrix, D, such that 
DXX'D' = DMD’ = І. 


This matrix could then be used to transform the matrix X into 0 
having a sample variance-covariance matrix Т: ; 
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Y = DX. 
Now, since 
DMD’ = I, 
then 
М = рр”, 


which means we can obtain D+ by Crout factorization and D 
itself by simple matrix inversion. 
The sample means of this new set of observations are given by 


j= Dž, 
where ў is a p component vector. If Y, refers to the ith column of Y 
and jj, refers to the ith element of the ў vector, а matrix with zero 
means can be obtained by subtracting j, from each element in Y, 
for each i, from 1 to p. 


Required Input Data 


АП that, is required of the user are the correlation matrix, the 
desired mean and variance for each variable, and the sample size 
(number of observations on each variable) which he desires to 
have generated. Of course, the user should insure that the correlation 
matrix is nonsingular and positive semidefinite. 


Output from the Program 
The program will generate a multivariate sample from a popu- 
lation with the specified means, variances, and covariances. Sample 
means, variances, and correlation coefficients are also computed. 


Summary 


A method is shown for creating a set of n observations on р 
variables, with the p variables having specified means, variances, 
and covariances. This method differs from previous techniques in that 
it uses Crout factorization to develop the desired variance-covariance 
matrix instead of using the methods of component or factor analysis. 
Because the procedure assumes that it begins with р variables 
having zero means, unit variances, and zero intercorrelations, а 
Procedure is also given for transforming the original data so that 
they fulfill these conditions. 
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A TEN FACTOR UNEQUAL “N” 
ANALYSIS OF VARIANCE PROGRAM 


NORMAN К. RUBIN ax» ALAN L. GROSS 
The City University of New York 


Tur program will perform an analysis of variance for а wide 
class of experimental designs: complete factorial, nested, ran- 
domized blocks, split plots, and other similar designs. 

Unlike many of the commonly available Analysis of Variance 
programs, this procedure possesses the following characteristics: 

1. The only restriction on cell size is that no cell is vacant. Cells 
may contain differring numbers of observations. There is also no 
upper bound to the number of observations. 

2. The program in its present form will process designs having up 
lo k = 10 factors. 

3. The only restriction on the number of levels allowed for the ith 


factor (Lj) is that the product TI (la + 1) be less than the di- 
mension of the array DATA. For ‘example, to process а 5 X 9 X 
8 x 5 x 2 design the dimension of Data must be at least 6 X 10 X 
4x6x 3 = 4320. On an IBM 1130 with 16K of memory; the 
dimension of data is normally 5000. 

4. The program, which is written in ASA BASIC FORTRAN, 
uses no external files. Thus there should be no problem in adopting 
the program to a computer having a FORTRAN compiler. A section 
of comments in the program listing deseribes in detail the changes 
to be made. 

5. The program allows as an option the user to insert а FOR- 
TRAN subroutine to preprocess or transform the input data. 

The unequal “N” design is analyzed using an approximate 
method described by Scheffé (1959). Basically this method leads one 
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to do an analysis of variance of the cell means with an adjustment 
to the final sum of squares, When the cell sizes are equal, the pro- 
gram produces the exact solution. 

Program input consists of a single control card, variable format 
card(s), and the data. The data for each cell are preceded by & 
header card giving a cell ID number and the number of cell 
observations. The cells themselves can be entered in any order. 

Output consists of the normal ANOVA table properly labeled, cell 
means, cell standard deviations, and the cell sizes. 

A source copy or listing of the program may be obtained from 
either author. 


REFERENCE 
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А COMPUTER PROGRAM FOR NONPARAMETRIC 
POST HOC MULTIPLE COMPARISONS 


JAMES J. ROBERGE* 
Temple University 


Mosr researchers in the behavioral sciences follow the decision 
to reject a null hypothesis, on the basis of an F-test, with a post hoe 
analysis of specific linear contrasts, using one of the various multiple 
comparison procedures (e.g, Scheffé, 1953, 1959; Tukey, 1953). 
Recently, Nemenyi (1963), Dunn (1964), and Rosenthal and Fer- 
guson (1965) discussed similar procedures which may be employed 
following the rejection of the null hypothesis by a given nonpara- 
metric test, i.e., the Kruskal-Wallis (1952) one-way analysis of vari- 
ance for rank data or the Friedman (1937) two-way analysis of 
variance for rank data. The program described in this paper is 
designed to perform these nonparametric post hoc multiple com- 
parisons. 


Rationale 
Nemenyi (1963) proposed a procedure for determining which pair- 
wise comparisons among k treatment populations are significant. 
This procedure requires the calculation of the statistic d, and the 
value of the constant C, for each pair of samples. 
The value of d is calculated by the following formula: 


d = |R, — R;.| 

Where È is the mean rank for a given sample (or experimental condi- 
tion) and j and j’ represent indices of disjoint subsets of the Ё samples 
(or experimental conditions). 
= 

lThe author gratefully acknowledges the support for this research which 
Was provided by a Faculty Research grant funded by Temple University. 
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The constant, C, for a given comparison is calculated by one of 
the following formulas: 

If the Kruskal-Wallis one-way analysis of variance is the ар. 
propriate statistical technique, then 


C= dia 1 


where k is the number of independent samples, x? has k — 1 degrees 
of freedom, a = .05, N is the total number of observations (or 
ranks), r is the number of sets of tied observations, t is the number 
of tied observations for a given set s, n is the number of subjects 
in a given sample, and j and j’ are as defined above; on the other 
hand, if the Friedman two-way analysis of variance is the appropri- 
ate statistical technique, then 
> (t^ — 9 


С 6m(k — 1) 


OEE E 
B -1N-D» 


ЖА n ‘i D 


where k is the number of experimental conditions, » is the number 


of subjects (or matched subjects), and Ж, а, т, з, and ¢ are as defined 
above, 


According to Nemenyi’s test, the hypothesis that two samples j 


and j were drawn from identically distributed populations is re- 


jected if the value of d for the samples exceeds the value of C. 


Dunn (1964) presented a procedure whereby rank sums from à 


combined ranking of k independent samples (Kruskal-Wallis model) 
are used to determine which populations differ. This procedure re- 
quires the calculation of the statistic y/o for each comparison. Each 
of these values is then compared with tabled values for the stan- 
А sai normal distribution to determine approximate probability 
еуез. 


The components of the statistic y/o are calculated by the following 
formulas: 
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where y is an arbitrary contrast, T is the rank sum for a given 
sample, n is the number of subjects in а given sample, and j and ў 
are as defined above; 


where c is the standard deviation of an arbitrary contrast and N, 
r, $, t, n, j, and j’ are as defined above. 

Rosenthal and Ferguson (1965) described a procedure which сап 
be employed to construct post hoe confidence intervals for experi- 
ments involving n rankings of k objects (Friedman model). This 
procedure requires the calculation of the mean and standard error 
of the mean for each contrast, and a constant for all contrasts. 


The mean, 7, and standard error of the mean, sr, for each contrast 
are calculated by the following formulas: 


LI * 2 
res пт (бт) 
у" уз а umm n(n — 1) 
where T is the weighted sum of the ranks for a given subject i 
(or group of matched subjects) and п is the number of subjects (or 
matched subjects). 
The constant, C, for all contrasts is calculated as follows: 


(k— Dn = 1 
€ а (n un k т 1) Fais 


where k is the number of experimental conditions, n. is the number 


of subjects (or matched subjects), F is the usual statistic, and a — 
05. 


The confidence intervals for the various contrasts are of the fol- 
lowing form: 


Т VCs SL ST + VCs 
where Lis an arbitrary contrast and 7, C, and зт are as defined above. 


Input 


The job deck set-up for each analysis is as follows: 
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Columns 1 = Nonparameteric test (1 = Kruskal-Wallis; 2 = - 
Friedman) 
2 = Nemenyi's test (1 = yes; 0 = no) 
3 = Dunn's test (1 = yes; 0 = no) 
4 = Rosenthal and Ferguson test (1 = yes; 0 = no) 
5-6 = Number of samples or experimental conditions 
(k) 
7 = АЦ possible pairwise comparisons (1 = уе; 
0 = no) 

8-10 = If column 7 is 0, then the number of comparisons 
is punched in these columns; otherwise, they are 
left blank. 

11-16 = If column 1 is 2, and column 4 is 1, then the 
F-ratio for the Rosenthal and Ferguson test 
(see above) is punched in these columns (Note: 
the decimal must be punched); otherwise, they 
are left blank. 
Contrasts matriz format card 


This F-type variable format card describes each row of the arbi- 
trary contrasts matrix. This format may be punched in any of the 
columns on the card. If column 7 on the problem card is 1, then this 
card is omitted. 

Arbitrary contrasts matriz 


This matrix is entered one row at a time, Each row must 
begin on a new card and must have k weights indicating the con- 
trast to be made. These cards must be punched in accordance with 
the F-type contrasts matrix format card (see above). If column 7 
on the problem card is 1, then these cards are omitted. 


Data format card 


This F-type variable format card indicates the location of the 
тазу scores (or ranks) on the data cards, This format may be punched 
in any of the columns on the card, 


Sample Card(s) 


A card (or cards) indicating the size(s) of the sample(s). For the 
Kruskal-Wallis test, the number of subjects in each sample is punched 
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en the card (s) using 2613 format. For the Friedman test, the num- 
ber of subjects in the sample (or matehed samples) is punched on 
the card using I3 format. 


Data deck 


"These cards contain the data for each sample (or experimental 
condition) and must be punched in accordance with the format 
specified on the F-type data format card (see above). For the 
Kruskal-Wallis test, the data are punched by sample with the data 
for each sample beginning on a new card. For the Friedman test, the 
data are punched by subject (or group of matched subjects) with 
the data for each subject (or group of matched subjects) begin- 
ning on a new card. 


Last card 


If the user wishes to terminate the program, then the card im- 
mediately following the data deck must have the word FINISH 
punched in columns 1 to 6. However, if the user wishes to analyze 
another set of data, then this card is a blank card and the job deck 
is arranged sequentially (as described above) beginning with the 
problem card. 


Output 

The computer output for the Kruskal-Wallis test includes the 
value of H (corrected for tied ranks), the number of degrees of 
freedom, and the average rank for each sample. Moreover, if the null 
hypothesis is rejected, then the output for Nemenyi’s test includes 
the values of d and C for each contrast, and the output for Dunn’s 
test consists of the value of each contrast, the standard deviation of 
each contrast, and the value of the statistic y/c. 

The computer output for the Friedman test includes the value of 
x? (corrected for tied ranks), the number of degrees of freedom, 
and the average rank for each experimental condition. Furthermore, 
if the null hypothesis is rejected, then the output for Nemenyi's test 
is as described above, and the output for the Rosenthal and Ferguson 
test consists of the 95 percent confidence interval for each contrast. 


Capabilities and Limitations 
The program, which is written in FORTRAN IV, can handle 
а maximum of 30 samples (or experimental conditions) and 200 
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subjects per sample (or experimental condition). Jobs may be run 
sequentially as described above. 


Availability 


Copies of this paper and a source listing which includes input 
and output data for sample problems can be obtained by writing 
to Dr. James J. Roberge, Temple University, Department of Educa- 
tional Psychology, Philadelphia, Pennsylvania 19122. 
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A PROGRAM OF SCHEFFE'S METHOD 


RANDALL M. PARKER 
The University of Texas at Austin 


Scuerré (1959) has reported a method for making any or all 

- possible comparisons among means or combinations of means in 

equal or unequal n ANOVA designs, regardless of whether significant 

i Р values are obtained. Because of its flexibility, Scheffé's method is 
a particularly desirable multiple comparison procedure. 

SCHEFFÉ is a FORTRAN program that computes Scheffé’s 
‘method for up to 100 means or combinations of means. A unique 
feature of this program is that it computes а corrected F value 
and the exact probability of F, which is analogous to the usual 
procedure of computing confidence intervals, but somewhat easier 
to interpret. The input parameters are the number of group means 
to be compared, the number of subjects, the error term (usually 
MS within), the tabled F value at appropriate degrees of freedom 

and desired probability level (optional), the group means, the 

ms for each group, and the coefficients for each comparison. Output 
includes the arithmetic difference among the means, the smallest 
statistically significant difference among means (only if the F value 
parameter is input), a corrected F value, and the exact probability 
‘of the F value for each comparison. Subroutines PRTS and PRBF 
(Veldman, 1967) are called by SCHEFFE. 


E REFERENCES 

E Scheffé, H. A. The analysis of variance. New York: Wiley, 1959. 
Veldman, D. J. Fortran programing for the behavioral sciences. 
— New York: Holt, Rinehart, and Winston, 1967. 
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А COMPUTER PROGRAM FOR THE COMPILATION 
OF DATA FROM CLASSROOM OBSERVATION 
SYSTEMS HAVING MUTUALLY EXCLUSIVE 

CATEGORIES! 


THOMAS B. GREGORY* 
Indiana University 


Tue past decade has witnessed the development of а profusion 
of observation instruments that attempt to record objectively the 
ongoing verbal and/or nonverbal events occurring in the class- 
тоот. Flander's Interaction Analysis (1960) and the several re- 
visions of the OScAR (e.g, Medley and Mitzel, 1958; Medley, 
Schluck, and Ames, 1968) are notable examples of this trend. 
Simon and Boyer (1967) reported that over 50 such systems had 
been developed by 1907. Even modest extrapolations of the ac- 


which сап, in turn, lead him to attempt alternative tactics per- 
ceived as being more desirable. Second, such observation systems 
can be valuable criterion measures in research applications. 

Many computer programs exist which compile the data required 
to fulfill either or both of these functions for individual observa- 


атын research was conducted with partial support of USOE Grant No. ОЕ 
6-10-108, Research and Development Center for Teacher Education, The Uni- 
.. тегу of Texas at Austin. 
` 2The author wishes to acknowledge the assistance of Dr. Donald J. Veldman 
in the preparation of a preliminary form of this program. 
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tion systems. However, two problems arise from such an approach. 
First, many systems are in an almost constant state of evolution. 
А computer program written specifically for such a system must be 
rewritten as its system changes. Second, many systems are devised 
for a specific circumstance which may never occur again. Develop- 
ing programs for such unique situations becomes prohibitive be- 
cause of the time factor involved. 

Program Cosan (Classroom Observation System ANalysis) is 
а general purpose FORTRAN program that can compile data 
needed for fulfilling either or both the feedback and/or research 
functions for any observation system containing up to 25 mutually 
exclusive categories. À series of subroutines allows a wide range of 
input and output options that facilitate adaptation of the program 
to diverse feedback and/or research contexts. 


Data Deck Arrangement 


Card 1 Parameter Control Card 
Columns: 


1-2 Number of categories in the system (Max., 25). 

3-4 Number of ratios to be specified (Max., 20). A 0 
suppresses this function. 

5-6 Number of behavior sequences to be punched 
(Max., 20). A 0 suppresses this function. 

7-8 Percentage desired for minimum cut-off point on 
high-frequency cell listings (e.g, a 3 will cause 
only those cells containing at least 3% of the total 
behavior to be listed). A 0 suppresses this listing- 

9-10 A 1 if any printed output is desired; 0 otherwise. 

11-12 A 1 if any punched output is desired; 0 otherwise. 


The following 4 parameters may be left blank if 
cols. 9-10 = 0. 


13-14 A 1 if printed matrix is desired, 0 to suppress it. 

15-16 А 1 if printed column (category) totals are desired, 
0 to suppress them, 

17-18 A 1 if printed percentages of total behavior for each 
category are desired, 0 to suppress them. 

19-20 А 1 if printed ratios are desired, 0 to suppress them- 


23-24 


25-26 


27-28 
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The following 3 parameters may be left blank if 
cols. 11-12 = 0. 


А 1 if punched column (category) totals are desired, 
0 to suppress them. 

А 1 if punched percentages of the total for each 
category are desired, 0 to suppress them. 

A lif punched ratios are desired, 0 to suppress 
them. 


Punched data are returned 10 variables per card. 
Each card contains the subject’s identification field 
(cols. 1-8) а data identification code (col 9 for 
which T = column totals, P = percentages of total 
behavior for each category, R = ratios and 8 = 
sequences), and a data card number (col. 10). 
For example, a punched card displaying R2 in 
cols. 9-10 indicates that it is the second ratio card 
for the subject identified in cols. 1-8. 


Graphic Display Spacing Specifications 

The following 7 parameters allow the optional 
graphic display of the sequence of behavior to 
reflect logical groupings of categories within the 
system. For example, Flander’s original 10 category 
system contains 4 logical groups: indirect influence 
(4 categories), direct influence (3 categories), 
student talk (2 categories), and silence (1 category). 
Therefore, 4, 3, 2, and 1 would be the first 4 param- 
eters which would be followed by 3 parameters 
left blank. 

Number of categories desired in first group. Set 
equal to category N if no grouping is desired. Set 
equal to 0 to suppress the graphic display. 


The following six parameters may be left blank if 
cols. 27-28 equal either the category N or 0. 


29-30 Number or categories desired in а 2nd group if 


needed, blank otherwise. 
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31-32 Number or categories desired in a 3rd group if 


needed, blank otherwise. 


33-34 Number or categories desired in a 4th group if 


needed, blank otherwise. 


35-36 Number or categories desired in a 5th group if 


needed, blank otherwise. 


37-38 Number or categories desired in а 6th group if 


needed, blank otherwise. 


39-40 Number or categories desired in a 7th group if 


Card 2 


Columns: 
1-1 


2-2 


needed, blank otherwise. 
Category Code Card 


FORTRAN character on data cards corresponding 
to category 1. 
FORTRAN character on data cards corresponding 
to category 2. 


N—N FORTRAN character on data cards corresponding 


to category N. 


N + 1—N + 1FORTRAN character on data cards used as a stop 


Next N, Cards 


Columns; 


signal indicating the end of a subject’s data. 


The ordering of the output display of categories 
may be altered by simply arranging the order of 
their corresponding FORTRAN characters on this 
card. A / cannot be used as a category symbol. 


Ratio Specification Card(s) (N, = Parameter 
Control Card, cols. 3-4). Omit if N, = 0. 


1-20 Ratio identification. Any legal FORTRAN 


characters may be used. 


21-80 Ratio specification. Categories should be identified 


by the same symbols used on data cards. Addition 
is assumed between categories except when the 
beginning of the denominator is signaled through 
use of a /. One / is permitted, though not mandatory, 
in each “ratio.” A category may be "weighted" 
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by repeating it a desired number of times in the 
numerator and/or denominator. No blanks are 
permitted in this field until the end of the ratio. 
The total of all categories may be indicated by 
using the stop signal (see Category Code Card). 


Next Card Sequence Specifications Card 
Omit if Parameter Control Card, cols, 5-6 = 0. 


Columns: 
1-2 Matrix row number containing cell of first behavior 
sequence to be punched. (e.g., 4 for the 4, 10 cell.) 
3-4 Matrix column number containing same cell (e.g., 
Ka 10 for the 4, 10 cell). 
| 5-80 Specification of up of 19 additional behavior se- 
| 
] 


quences (matrix cells) using same format. 


Data are returned as the percentage of total be- 
havior loeated in each specified cell. 


Мех Card Beginning of Subjects' Data 


Columns: 

1-8 Subject identifieation. Any legal FORTRAN 
characters may be used. This field must not be 
blank. 

9-10 This field is not read by the machine and is for 
clerical uses such as numbering the data cards for 
each subject. 

11-80 First 70 behaviors (category entries) for this subject. 


Repeat same format for second through Nth cards 
for subject 1. Any number of behaviors (max. — 
1000; ie., 14 cards plus 20 cols. of card 15) are 
permitted for each subject. When data for subject 1 
are completed, the stop signal should appear in 
the next data column. Begin subject 2's data on 


next card. 
Last Card Blank card indicating the end of the data. 
| Listings of Program Cosan may be obtained either from the 


» Research and Development Center for Teacher Education at The 
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University of Texas at Austin, Austin, Texas 78712; or from the 
Author, 125 School of Education Building, Indiana University, 
Bloomington, Indiana 47401. 
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AN ITEM ANALYSIS AND SCORING PROGRAM 
FOR SUMMATED RATING SCALES: 


RICHARD L. KOHR 
Bucknell University 


Tris program has three major options for processing data origi- 
nating from an attitude scale of the Likert-type (summated rating 
scale). Option one produces an item analysis following procedures 
similar to those outlined by Edwards (1957, pp. 152-154). Option 
two yields a printout and/or punch cards containing the results 
of various scoring operations. Option three indicates that the user 
desires both options. 

Input to the program are item scores and any desired coded in- 
formation (e.g, demographic variables) which the user wishes to 
have reproduced, along with the scale score (s), on the punch card 
output. 

Regardless of the options selected, certain information is sup- 
plied to the user. This includes a page summarizing the options 
requested and the information contained on the program control 
cards. A second page gives the following total scale information: 
total number of items, number of response choices or categories, 
sample size, mean, variance, standard deviation, unbiased estimate 
of population variance, estimated standard deviation, Coefficient 
Alpha, standard error of measurement, and the estimated average 
inter-item correlation. : 

The item analysis option provides the following. For each item 
a table is printed for both a low total attitude score group (lowest 
27%) and a high total attitude score group (highest 27%). The 
tables include the frequency and proportion of occurrence of each 


1The development of this program for the IBM System 360/67 was sup- 
ported by the Pennsylvania State University Computation Center. 
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response choice. The item mean and standard deviation is pre- 
sented for both contrast groups. In addition, the output includes 
an adjusted correlation between each item and the score based on 
the composite of the remaining items on the scale. This represents 
the correlation between an item and the total score with the contri- 
bution of that item to the total score removed (Guilford, 1953). 
This correlation is based on all the respondents and not just those 
in the high/low contrast groups. Lastly, a frequency distribution 
of the total attitude scores is printed along with the N, mean, 
variance, and standard deviation of the two contrast groups. 

The scores on which the various statistics are based are the values 
assigned to the response choices which must form a meaningful 
continuum. Often this takes the form of strongly agree, agree, . . » 
strongly disagree. Since attitude scales frequently contain some 
statements which are favorable toward the object and some un- 
favorable, it is necessary to reverse the scoring of certain items. 
A conventional seoring system assigns the higher value to the 
response indieating favorableness such that when the item scores 
are summed they are directionally consistent. The program can 
perform this reversing operation and thereby can permit considerable 
flexibility as to the nature of the input data. 

The second option may consist of printed and/or punch card 
output consisting of each respondent's identifying information (e.g. 
subject and demographic codes) and total (or subscale) score(s). It 
is also possible to receive punch cards containing scored items (made 
directionally consistent) as well as all items pertaining to а sub- 
scale clustered together. 

The program written entirely in FORTRAN IV (H-level) con- 
tains no machine specific subroutines. Write-ups, listings of the 
source program and trial data, and a copy of the output may be 
obtained by writing to the author. 
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THE USE OF THE COMMON/DATA STATEMENT TO 
DETERMINE THE TYPE OF AN EVENT IN 
SIMULATION STUDIES 


EDWIN L. ANDERSON 
Oregon State University 


Тнв determination of times between events in simulation studies 
based on stochastic processes involves the use of a random number 
generator (RNG), an expected value for the time of the event 
which is usually derived from а real system, and the inverse 
transformation of the cumulative distribution function of some 
known distribution. When this routine indicates that it is time for à 
decision to be made concerning an event, & simulation program 
written in FORTRAN can accomplish this (when the probability 
of such an event occurring is known) by IF statements and the 
RNG. For example, if 80 per cent of the students entering а high 
school counseling center desire to have а conference with a counselor, 
the RNG can be called to generate random variates uniformly 
distributed between 0 and 1. If the random variate is 80 or less, 
the program will branch to a sequence which determines the type 
of conference and the length of the conference. If the random 
variate is greater than .80, the program will branch to a sequence 
which determines the purpose for the student being in the system 
and the length of time spent їп the system. 


Input 
If it has been decided that the next student entering the system 
desires a conference, IF statements may be used to determine the 
type of conference. However, when the number of possibilities is 
great, this routine becomes somewhat lengthy, and the COMMON/ 
DATA statement in the computer program handles the decision 
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more efficiently. If the conference types and the probabilities from 
the real system are identified as listed in Table 1, the COMMON/ 
DATA statement would be: 


COMMON/DATA/TYPE (100) 
DATA ((TYPE (1) 1=1,100) =10(1) ,2(2),13(3), 
18(4) ,10(5) 8 (6) 33(7),3(8),3 (9) ) 
The COMMON/DATA statement can be thought of as the 
cumulative frequency table displayed in Table 2. 
The RNG, supplied with an initial random number, will generate 
a random variate uniformly distributed between 1 and 100. The 
cumulative table is searched to locate the position of that variate 
and the corresponding number representing a type of conference is 
determined. This is the type of conference for one student. The 
process would then be repeated for each student entering the system 
and desiring to have a conference. It is not necessary to input 
additional random numbers, as the next random variate to be used 
will be generated by the RNG. 


Output 


In simulation studies dependent on stochastic processes, the 
ordinary procedure is to make several computer runs and to use the 
average of the runs for the simulated system. The data presented 
in Table 3 are a result of three runs using the COMMON/DATA 
statement given above. 

If the mean probabilities presented in Table 3 simulate the type 
of conferences held during one day at the counseling center, a simu- 


TABLE 1 
Types of Conferences and Probability of Occurrence 

1 Academic .10 
2 Attendance .02 
3 Employment 13 
F Peronakencial .18 

ost High School .10 
6 Records .08 
7 Schedule .33 
8 Vocational 03 
9 Other 03 


А UAE. eu 55 0 05 1 ms HE 
* Types of conferences established by six counselors at a high school i babili- 
ii i ii ling center. Pro! 
ties were derived by tracing 2,127 students through dase ea abana H ine random 
sampling days during the winter months of 1069-70. the counseling center during nine 
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Cumulative Frequency Table from COMMON/DATA Statement 
Random variate Conference type 
98-100 9 
95-97 В 
62-94 1 
54-01 0 
44-53 5 
26-43 4 
13-25 3 
11-12 2 
1-10 1 


_ lation of several days will result in probabilities which will be а 
| “good fit” to the actual probabilities in Table 1. Thus, the 
COMMON/DATA statement may be a vital aspect of а simulation 


program. 
TABLE 3 
Probability of Type of Conference Using a RNGand COM MON /DATA Statement— 
100 Counselees 
Run 1 Run 2 Run 3 Summary statistics for all runs 
(Initial random 
no.) (27935) (6761) (69831) 
Conference 
type Probability Probability Probability — Mean sD SE 
1 .10 .09 11 .1000 .0100 .0058 
2 .01 .03 .03 .0233 .0115 0067 
3 15 13 18 1533 „0252 .0145 
4 18 17 14 .1600 10265 -0153 
5 .08 12 10 .1000 0200 0115 
6 .07 06 05 .0600 .0100 
7 .37 .33 .34 3467 10208 0120 
8 .01 .03 .04 .0267 .0153 .0088 
9 .03 .04 .01 “9267  .0153 -0088 
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AN EDP SYSTEM PACKAGE 
FOR SCORING THE INTERPERSONAL CHECK LIST 


DONALD E. LANGE 
University of Victoria 


the package to set up a center to score ICLs for profit). 
Description of the System 


system will support standard Fortran IV. 
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AvrHOUGH the Interpersonal Check List (ICL) (LaForge and 
Suezek, 1955) has found a wide and varied usage, many potential 
users are reluctant to include it in their assessment battery because 
of the laborious task involved in keying for all twenty variables 
and the subsequent computation necessary to obtain its summary 
scores. To correct this situation, an EDP system package has 
been prepared which will relieve the ICL user from such clerical 
labor. The package is offered free of charge to anyone wishing & 
copy, provided that it is not used for commercial gain (e.g., using 


The system has been designed to relieve the user from the clerical 
tasks involved in keying and scoring the ICL form IV (LaForge, 
1963). The computer program itself consists of a main driver and 
two subroutines; it is written in basic FORTRAN IV. Currently, 
tested packages are available for IBM system 360 OS/DOS, IBM 
1130, and PDP 10. If the user wishes to employ the package on 
other systems, it is a simple matter to change job control language 
in order to make it compatible, provided of course, that the other 


Data input, which may be of two types, may include up to eight 
response protocols from a single subject. First, the preferred type 
is from the Document No. 511 standard form IBM 1230 Optical 
Mark Scoring answer sheets, upon which the testee has marked his 
responses to the ICL. This standard form js then read on the IBM 
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1230, whieh transfers the marks, in special 1230 code, to puneh 
ecards that will be later keyed and scored by computer. Second, И 
the user has already band-keyed the ICL for the necessary twenty 
variables, the variable data may be entered for computation of the 
summary scores. This step, of course, does not utilize the package's 
ability to key the ICL and to save the user from such an uninterest- 
ing task, but it will save him from the lengthy job of computing the 

scores, Therefore, if someone is planning to administer 
the ICL, it is recommended that he change his answer sheets to 
Document No. 511 standard form IBM 1230 and ask his examinees 
to mark the first answer choice for each ICL adjective that applies 
(leaving it blank if it does not). He should let the item numbers 
on the answer sheet correspond to the ICL's adjective num- 
bers. Then on receipt of the system package, these protocols may 
be read on the IBM 1230, keyed, and scored by computer. 

Since the output may be either printed, or of printed and punched 
form, the user is allowed to utilize the resulting ICL data for 
further analyses without the additional key punching step. 

Materials 

The materials which will be returned to those who request the ICL 
Searing Package will consist of a control sheet and drum card for 
the IBM 1230 Optical Mark Scoring Reader, the computer program 
source decks, and a detailed manual on how to utilize the scoring 
system. When ordering, one should state the configuration of the 
computer system so that the most nearly applicable package may 
be returned. 

Please send request for this package to Donald E. Lange, Depart- 
ment of Psychology, University of Victoria, Victoria, British Co- 
lumbia, Canada. 
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MERMAC TEST AND QUESTIONNAIRE 
ANALYSIS SYSTEM 


LAWRENCE M. ALEAMONI 
University of Illinois 


A test and questionnaire analysis system was designed to майн 
instructors in developing valid and reliable tests and to provide 
rapid and meaningful feedback to the instructor and students, 


Description of the System 


copy, edit, match, merge, sequence, sort, and recode the input data- 

Generally, the purpose of these programs is to prepare the data for 

input to the test and questionnaire analysis programs. The six test 

and questionnaire analysis programs allow the user to: 

1. Score item data and produce up to forty subscores for each 
individual, Each item and response may be weighted to arrive 
at the scores. Any item may be included in more than one 
subscore and be weighted differently in each. 

2. Take scores for a group of individuals and produce a fre- 
quency distribution and histogram, mean, median, standard 
deviation, Kuder-Richardson reliability, standard error of 
measurement, and Spearman-Brown 
of 90. In addition, individual raw scores, standard scores, and 
percentiles may be listed. Individual raw scores and standard 
scores can be weighted, summed, and the sum assigned letter 
grade. All these data can be easily pro ided to the student. 

3. Return to each student a page containing his test score and 
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a list of the items he missed with his responses and the correet 
responses. 

4. Analyze his item data by providing a plot of the percentage 
of individuals responding to the keyed response by fifths of 
the total score distribution. For each item alternative the 
proportion of individuals responding, a point biserial correla- 
tion, and the number responding to each alternative by fifths 
is provided. 

5. Analyze his item data by using some external criterion rather 
than the keyed test score. 

6. Summarize item data from questionnaires or tests with no 
known correct answers by providing a frequency distribution of 
responses, a weighted mean, and a standard deviation for each 
item. In addition, subscores may be generated with means, 
standard deviation, split-half reliabilities, and percentage of 
individuals responding to the contributing items. It is also 
possible to assign deciles to the item and subscore means 
based on a table look-up. 


Summary 
The MERMAC system is written in Basic Assembly Language 
(BAL) for both IBM System/ 360 models 40 and above and 
IBM System/ 370 models 135 and above which have Operating 
System (OS) with Queued Sequential Access Method (QSAM) 
support. 

Additional information about the program and its availability 
may be obtained from Lawrence M. Aleamoni, Measurement and 
Research Division, 307 Engineering Hall, University of Illinois, 
Urbana, Illinois 61801. 
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John W. Best. Research їп Education. ed.) Englewood Cliffs, 
N. J.: Prentice-Hall, 1970, Pp. vi + 300, 18.95. 


It has been 11 years since Best's first edition of Research in Ad 
weation and judging from the extent of the єтєй 
doesn't look as though much has happened during this time span, 
To be fair, however, I should preface further remarks by saying 
that the book appears to be а good one and will meet the reasons. 
ble objectives set by Best. He says that the book won't make one 
an expert in research—and, that is refreshing to hear. As Traub 
(1969) has discussed, facilitating the process of making better 
on ot research ва compared to мна & researcher 

quite different calling, 
` But this strays from the immediate mission. What 
ir brief run-down of the contents of this second edition with 
what I hope are relevant comments interjected along the way. 

Bett covers 13 chapters and four appendices in approximately 
400 pages. The first three chapters deal with such questions as what 
research is, identification of а research problem (whieh includes 
Ghost five pages of suggested topical areas for research), and the 
use of reference materials, The chapter on reference materiale— 
While possibly useful—takes up about 50 pages by listing far too 
many sources of information that students could go to. 

Best goes over the Dewey decimal system 
use of the library. While this material 
leads that graduate students can follow up—(a) the necessity for 
the extensiveness of the list of reference material is highly debate- 
Able and (b) the inclusion of “how to use the library” information 
seems inappropriate for a text designed primarily to cater 
graduate ats True—some graduate students might still 
Know how to use the library—however, these people must be 


sf 


examples and appropriate limitations. The most notable difference 
between the revised pnd earlier edition is the reworking of the chap- 
ter on experimental methods. Best brings in, summarises and dis- 
cusses quite effectively notions from Campbell and Stanley (1966). 
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Matching studies are down played this-go-round as compared to 
the '59 version. To me, this chapter is the nicest aspect of his cur- 
rent reworking of the book. 

Chapter 7 discusses various tools of research such as question- 
naires, surveys, psychological tests, interviews, etc. However, in- 
cluded is a fairly inadequate presentation of the concepts of re- 
liability and validity. These topics could have definitely been 
improved and elaborated in more detail. It would not have hurt 
(although it might have been necessary to have placed it after the 
chapters on data analysis) to identify some of the correlational 
ways reliability and validity can be estimated and to discuss some 
of the factors that affect them. 

Chapter 8 is very brief and discusses ideas related to the inter- 
pretation of data. It ineludes such things as scales of measurement 
(with an inappropriate example of ranks of professors for the nom- 
inal seale), ways of tabulating data and—interesting enough—a 
filing scheme by McBee Systems for coding and retrieving informa- 

ion. 

Chapters 9 and 10 present basic computational techniques—nine 
dealing with descriptive data analysis and 10 following on to in- 
ferential ideas. This is a reorganization, expansion and splitting up 
of material covered in one chapter in the earlier edition. The cur- 
rent treatment is logically better but still has some bugs that could 
have been ironed out. For example, the assumed mean procedure is 
outlined with the statement made that the assumed mean value is 
the true mean value. This is misleading since that would only occur 
in а perfectly symmetrical distribution. There are also formulas for 
computing the median, percentiles and the p th percentile from 
grouped data. АП these seem rather cumbersome and unnecessary— 
especially for a book of this type. In addition, the interquartile 
and semi-interquartile ranges are diseussed along with the formula 
for finding the standard deviation using the assumed mean proce- 
dure. Again—these will confuse the student by putting unnecessary 
clogs in the wheel. The mechanies of how to compute the standard 
deviation are given without much indication of “why” and “how” 
it is used. As for correlation, the r and rho formulas are presented 
without much mention of why one would logically use one as ор- 
posed to the other. Regression lines are discussed in one section 
while the prediction equation itself js given in another section. 
Bringing these two together would have been better. In the chapter 
on inferential statistics, Best discusses various sampling techniques 
along with such things as the standard error of the mean, critical 
ratio, ¢ test, simple ANOVA and chi square. In general, this material 
is presented well and in a sensible format, 4 

Chapter 11 deals with the research report and its write-up. Again, 
while the material is sensible and follows traditional patterns of 
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stylistic considerations, most of the material is covered locally at 
a particular university where they publish their own style guides. 
However, for those who are simply interested in basic mechanics 
and hints of the format of a research report (especially those who 
aren't in university settings), the chapter will probably be helpful. 

Chapter 12 presents several summaries of “significant” 
studies along with limited discussion of various points of these 
studies, While I'm dubious of the usefulness of this chapter, my 
feeling is only a hunch. 

The last chapter deals with various aspects of federal support for 
educational research. Regional labs, the ERIC system, the support 
(although it is now dwindling!) for the training of educational re- 
searchers, and ways and means for preparing moneyseeking researc 
proposals are discussed. This is one of the best chapters in the book. 

Finally, in terms of the content, the four appendices include (1) 
a research report evaluation form, (2) а glossary of statistical for- 
mulas and symbols, (3) areas under the normal curve, and (4) per- 
centile scores for any given two scores. 

In summary, Best has produced an improved edition of Research 
for Education. The basic improvements include the chapters in 
experimental research, data analysis reorganization, and informa- 
tion of federal support to educational research. The book is quite 
readable with a low difficulty level and should be useful in assisting 
students toward their pursuit of an overview to aspects of the re- 
search process. More thought should have been given to the descrip- 
tive analysis chapter, but, other than this, Best’s book is “better” 


than it previously was and should be well received. 
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Sheldon Blackman and Kenneth M. Goldstein. An Introduction to 
Data Management in the Behavioral and Social Sciences. New 
York: John Wiley, 1971, рр. vi + 104. 8795 and $5.95 (Paper- 

ack). 


This brief treatment of data setup and analysis procedures for 
use with computers is intended to be used by persons with no pre- 
vious exposure to computers and/or electronic data processing equip- 
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ment. A major theme of the text is that a plan for the conduct of an 
experiment requires & parallel plan for the use of a computer in 
data handling and statistical analyses. Jargon has been kept to & 
minimum and a glossary of words commonly encountered in read» 
ing about data management and program writeups has also been 
provided. 

After an introductory chapter, nine other chapters are presented 

with content related to translation of data from a raw to a machi 
readable form and use of "canned" computer statistical program 
packages. 
Within the data translation section of the text, there are short 
discussions relating to principles for coding data and the presenta- 
tion of data in a matrix arrangement (Chapter 2); the develop- 
ment of a data-processing format and the punching of data into 
cards (Chapter 3); the concept of a variable format (Chapter 4); 
the editing and coding of data prior to card punching (Chapter 5); 
and of the purposes and use of auxiliary data processing equipment 
(Chapter 6). 

The discussion of canned statistical programs and/or systems 
begins in Chapter 7 with a short description of the purpose and 
examples of computer installation control cards (ie., JCL’s—Job 
Control Language cards). Chapters 8 and 9 present a description 
of and the commonly required program cards required in setting up 
BMD and P-STAT program runs. This presentation contrasts the 
use of the variable format (ie; BMD) and free-format (i.e., P- 
STAT) methods of specifying data to be processed in a computer 
run. Chapter 10 covers a wide range of topics related to acquisition 
and modification of other available statistical programs and pack- 
ages. Two appendices are also included. Appendix A presents 8 
specimen set of data in a coded matrix arrangement while Appen- 
dix B discusses matrix manipulations useful for data reduction pur- 
poses, Both could have been omitted with no loss in communication 
of content to the readers. Appendix A presents information pre- 
viously given in the body of the text while the second appendix 
does not present enough information for an individual without а 
background in measurement to be able to use the ideas presented 
within this section. Questions are provided at the end of each chap- 
ter, however, they are lacking in structure and are unlikely to help 
a person who is in a self-study situation to understand the basic 
concepts presented in the chapter. In contrast, a short list of read- 
ings, keyed by topic, appears at the end of each chapter and is 
likely to be more helpful. 

The cost of the paperback version text indicates that a purchaser 
pays dearly especially when one considers that the 100 pages con- 
tain little in the way of mathematical equations, generally the cause 
for higher prices on scientific books. However, a casual perusal of 


T 
several popular textbooks on behavioral and social science research 
methods indicates students will find little help from these sources 
on how to set up data and program cards in order to make а com- 
puter run. Moreover, it is this reviewer's experience that students 
learn to use a computer for data management and analysis purposes 
if be has had a course in computer programming (60. FORTRAN) 
or by having access to а person with experience in using а computer. 
If the novice computer user does not find himself/herself in either 
of these two situations, then this text would seem likely to offer a 
way of getting his/her data through а computer run without being 
overwhelmed by the process of having to use computer. In con- 
clusion, while this text is not likely to be purchased by 
with substantial experience in conducting research, it would seem , 
appropriate for use by persons who are setting up data to run 
through a computer for the first time. ‘ 


Јонм L. WASIK : 
Center for Occupational Education 
North Carolina State University 


George A. Ferguson. Statistical Analysis in Psychology and Edu- 
cation. (3rd ed.) New York: McGraw-Hill, 1971. Pp. xii + 
492. $10.95. 


planations and more examples, but rather an attempt to incorporate 
more topics within the book. Ferguson's statistics book is illustra- 
tive of this tendency, and also the inflationary tendency of sub- 
sequent editions to increase in price. The length of the book has 
inereased from 347 pages in the first editions, through 446 pages 
in the second edition, and now 492 pages in the third edition. The 
price of the book has gone from $7.00 in 1959 to $7.95 in 1966 
$10.95 in 1971! 

Considering the fact that 12 years, and apparently many adop- 
tions, have intervened between the first and third editions, one has 
a right to expect the author to have done some housekeeping by 
making corrections and responding to previously noted shortcomings 
of the book (Binder, 1960; Glass and Maguire, 1966). However, 
many of the criticisms made by the reviewers of the first and second 
editions of this book apply equally well to the third edition. Al- 
though the present reviewer does not agree with Glass and Maguire 
(1966) that Ferguson's “prose tends to be dry,” а detailed check of 
the third edition to determine whether the specific problems noted 
in the earlier reviews had been attended to revealed that many of 
them had not. Surely Ferguson and his publisher were aware “ 
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these reviews, but perhaps the former would explain, as he did in the . 
Preface (3rd ed.) in regard to the criticisms of the four consulting 
editors, that “, . . limitations of time have prevented me from in- 
corporating some of their more insightful recommendations." [ 

Nevertheless, in spite of its datedness in spots (Fisher is not now 
living, and 1944 is not currently!) and its seeming imperviousness 
to criticism, the third edition remains like the first and second ones 
a fairly technically correct and at least average introductory statis- 
ties book. The book is divided into four parts: I. Basic Statistics 
(13 chapters), IL. The Design of Experiments (7 chapters), III. 
Nonparametric Statistics (2 chapters), and IV. Psychological 
Test(s) and Multivariate Statistics (5 chapters). 

The first 13 chapters have changed little since the first edition, 
reflecting in their emphasis on scales of measurement (nominal, 
ordinal, interval, ratio) the language of psychological statistics 
during the late 1950s. The eight or so exercises at the end of each 
chapter are still rather unimaginative in content, but they may 
serve the purpose. In Chapter 4, $ is defined both as [E (X — Х)/ 
N) and [E (X — X)'/N — 1], but for most of the book the second 
definition is used. This ambiguity is partly the result of the author's 
seeming preference for the greater simplicity of [> (X — Xy/N] 
when it is employed in other formulas. In general, the first 13 chap- 
ters are straightforward, well-written, and perhaps sufficient for а 
traditional, one-semester course. To be sure, there are typos and 
other problems. For example, formula 10.5 (p. 140) needs an X in 
the denominator; in а symmetrical distribution the mean, median, 
and mode do not necessarily coincide (p. 52), and Ferguson's ex- 
planation of sampling is still inadequate. Concerning the material 
on correlation, MeNemar (1969) and Glass and Stanley (1970) 
handle the topic better, although the latter book is replete with 
typos, misprints, and other minor errors. Ferguson has also scattered 
the correlation material throughout the book (Chapters 7, 8, 21, 
pail ta 26), whereas most statistics books put more of it all 


Comparing Part II of the third edition with that of the second 
edition, Chapter 14 has been moved to 21, Chapter 16 to 25, and 
Chapters 17-20 down to 14-16. Chapter 17 on three-way ANOVA 
is new, and Chapter 14 of the second edition (Rank Correlation 
Methods) has been moved to Chapter 21 in the third edition. Chap- 
ter 18 of the third edition (Multiple Comparisons) is a short, new 
addition which finally mentions Tukey’s method. However, the poor 
statement of assumptions underlying ANOVA and the omission of 
the independence assumption—which also characterized the two 
earlier editions—have not been corrected. And in Chapter 16 (P. 
243) , the repeated measures ANOVA is still not in customary form, 
the reasons behind using sim? in the denominator of the F test being 
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unclear. Also, is Ferguson serious about the necessity of testing for 
differences among subjects by means of an F ratio? Finally, the 
differences between random, fixed, and mixed effects ANOVA models 
are not explaincd—a serious omission. 

The title of Part IV (Psychological Test and Multivariate 
Statistics) is somewhat misleading, since diseriminant function 
analysis, MANOVA, and other important multivariate techniques 
are not discussed. Furthermore, Chapter 25 (Score Transforma- 
tions: Norms) belongs in Part I, and Chapter 26 (Partial and Mul- 
tiple Correlation) belongs closer to Chapters 7 and 8. Finally, 
reviewer seriously doubts whether Chapters 23, 24, and 25 
statistics and factor analysis should have been included 
introductory statistics book. The author would hs 
time more wisely by attending to the criticisms of 
sions of the book and the suggestions of the consulting 
rather than writing new material which may serve only 
book. But as Binder (1960) and Glass and M 
eluded in their reviews of the first and second editions, 
quate text—no worse than most and better than many. 
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Binder, A. M. Review of George A. Ferguson’s Statistical analysis 
in psychology and education. EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT, 1960, 20, 863-869. _ 

Glass, С. V and Maguire, T. O. Review of George А. Р 's 
Statistical analysis in psychology and education, (2nd 
UCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
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Glass, G. V and Stanley, J. C. Statistical methods in education and 
chology. Englewood Cliffs, N. J.: Prentice-Hall, 1970. 
E. = ©. Psychological statistics. (4th ed.) New York: Wiley, 
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Robert M. Gagné and William J. Gephart (Eds.). Learning Re- 
search and School Subjects. Itasca, Ill.: Е. E. Peacock. 1968. 
Pp. ix + 268. $6.50. 


Learning Research and School Subjects is the record of the Eighth 
Annual Phi Delta Kappa Symposium on Educational Resear 
held October 28 and 29, 1967 at the University of California, 
Berkeley. Р 

The record of а symposium where leaders in а field meet and 
present overviews of their work on a topic has become one of the 
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most useful summaries a student of the topic can obtain. The dis- 
eussants, who are themselves experts оп the topic, bring out facets 
of the presentation that often remain obscure in other forms of re- 
search reporting. 

Five chapters corresponding to five sessions make up the book. 
These are (1) Concept Learning and Concept Teaching presented 
by Robert Glaser and discussed by Patrick Suppes, Evan R. Keis- 
lar, and James J. Gibson, (2) Perceptual Learning in Educational 
Situations presented by Eleanor J. Gibson and discussed by J. M. 
Stevens and Henry С. Ellis. (8) Two Scientific Approaches to the 
Management of Instruction presented by Ernst Z. Rothkopf and 


. Wittrock and discussed by Lawrence M. Sto- 
re C. Ellis. (5) The Quest for Prescriptive Values = 

Educational Programs presented by George C. Thompson а. 
discussed by Winfred Е. Hill and Paul Н. Mussen. 

The focus of the different sections is on moving laboratory re- 
search to practice in the classroom. The problems of making this 
shift are most clearly delineated in the first two chapters, areas in 
efinitive research is available than in the later chap- 
ters. One of the factors making translation difficult is the need in 
the laboratory to define goals in measurable terms that can be 
or dun with эке eee When related phenomena 

classroom, they usually are much more complex, 
contaminated by indeterminate variables and measurable with little 
EM that the changes are related to the names given the vari- 


The prob! of evaluation are brought out clearly in the first 
chapter on Concept Learning and Concept Teaching. Eon pointed 
out in his discussion that the laboratory studies deal almost ex- 
clusively with concept identification in which the learner, animal 
Р human, has to identify whether the critical factor is roundness, 

lueness, or darkness. These concepts are already within the reper- 
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racy and the depth of the formulation is very difficult. The order of 
the scale on which measurement is made is different from that used 
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t identification. The possibility of error of measurement 


elassroom setting is mueh greater than in the laboratory, but 
‘school learning it is complex concept formulation that is impor- 


` The situation is reminiscent of the story of the drunk down on his 
се beneath a lamp post obviously ке for something. А 


passerby, perhaps a psychologist, 
Е questioning learned that а key had been dropped. Further 
ing revealed that the key had been dropped half way up 

block. To the question of why he was looking here the drunk 
replied, “The light is better." Hopefully the laboratory psychologists 
сап be enticed away from where the measurements сап be clear 


` Eleanor Gibson’s presentation of perceptual learning was stimu- 


higher order units is provocative. She sees perception as а process 
of filtering the relevant from the noisy rather than a process of 
adding or ‘associating. In her higher orders she speaks of students 


endo to processes of conceptualization and self-directing 


which result in learning. 


clude translation, segmentation and processing. Rothkopf points out 
that many of the things learned by students in school “may be 
quite undesirable from the point of view of the instructor. He 
describes research designed to bring these activities under more di- 
rect and predictable control of the teacher or researcher. 

Wittrock describes three approaches to research on transfer of 
training. These are “(1) simplistic S-R models (2) mediated gen- 
eralization models (S-r-s-R) and (3) Gagné's model of 

ing sets.” These are quite straight forward and 
another. One of the participants raised the question as to whether 
they could not all be subsumed under a single category with subsets. 
Wittrock agreed that this is possible but expressed his belief that 
Separating the approaches makes further research more likely. 

Thompson closed the conference with a presentation on values, 
their nature, validity, and how they can be taught. Obviously, this 
is a most complex and difficult topic. The conference reached the 
E conclusion. the book and of the con А 

а criticism should be made of ence, 
it would have to be that the topic was too broad to handle ade- 
quately. Even so, a clear understanding of the point of view of 
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many of the participants does emerge. Students will find Learning 
Research and School Subjects a useful orientation to the views ofa 
number of leaders in a number of fields. 
Joun А. В. WILSON 
University of California 
Santa Barbara, California 


Stephen Isaac and William B. Michael. Handbook in Research and 
Evaluation. San Diego: Robert R. Knapp, 1971. Pp. vi + 186. 
$7.95 and $4.95 (paperback). 


Handbook in Research and Evaluation falls somewhere between 
recipe oriented books like Bruning and Kintz (1968) and Winer 
(1971) on the one hand, and more discussion-oriented books such as 
Kerlinger (1964) and Dayton (1970) on the other hand. Isaac 
and Michael have put together a brief volume that achieves a happy 
balance between the two ends of the continuum. As they state in 
the Foreword, the book was prepared for a researcher or research 
evaluator who simply wants an overview, a summary of alternative 
approaches, an exhibit of reference models, or a listing of strengths 
and weaknesses of different methods of research. In doing во, 
the authors rightfully add the caution that this “balance” approach 
has risks involved—those being possible oversimplifications of the 
material. However, according to this reviewer, the material has 
been handled in such a way as to make the risk factor minimal. 

The book is organized into five chapters. Chapter 1 deals with 
planning research and evaluation studies. Topics range from com- 
mon mistakes in the formulation of a research problem, through 
advantages of a pilot study, to planning for computer analysis 
and data processing. Chapter 2—the lengthiest of the five—deals 
with research designs and strategies of research. The authors, for 
convenience sake only, categorize nine different types of research 
(historical, descriptive, true experimental, еќе.). In addition, simple 
research designs are explicated along with brief discussions of im- 
portant topics such as confounding, interaction, internal and external 
validity, statistical regression, and some disadvantages of matching 
аза control device. Chapter 3 presents information on instrumenta- 
tion and measurement. A test evaluation form is given, the oft re- 
printed normal curve table from The Psychological Corporation is 
presented, techniques for item analysis are discussed, reliability 
and validity are outlined, along with information on such instru- 
mentation as mailed questionnaires, research interviews, the seman- 
tic differential and creativity tests. Chapter 4 provides a summary 
of the most widely used statistical tools—both basic descriptive 
and elementary inferential. Computing guides are given for such 
measures as percentiles, means and standard deviations, correlation, 
chi square, and the ¢ test. This chapter also summarizes informa- 
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tion concerning hypothesis testing (type I and II errors), the power 
of a statistical test, and sampling. Chapter 5, the final one, presents 
criteria and guidelines for writing research proposals (do any of 
these work?) and reports. Included is a checklist for evaluating a 
research article, examples of vague and clearer behaviorally written 
objectives, the inevitable Taxonomy of Educational Objectives 
(Bloom et al., 1956), a model for evaluating school programs, and 
finally—an excerpt from Skinner (1959) concerning а dissenting 
view of research methodology and theory. An interesting way to 
end the book, One wonders if it were a Freudian slip. 

A few minor negative points on the material should be pointed 

out. In Chapter 3 on measurement, it is mentioned that the stanine 
scale consists of nine intervals—each being one-half of a standard 
deviation wide. This is true for stanines 2 through 8, but not 
stanines 1 and 9; these go to infinity. A small point, but it could 
have been clarified. In the section on item analysis, chi square 
is given as the technique, with the chi square value being computed 
from the 2 x 2 table of Right-Wrong versus High scorers-Low 
scorers. In this discussion, it is pointed out that a significant chi 
square value indicates that a dependable difference between the 
number of correct answers exists between the high and low scorers, 
and therefore the item should be retained. However, one сап obtain 
а significant chi square value when more of the low scoring stu- 
dents answer the item correctly than do the high scoring students. 
Therefore, it should be mentioned that if more of the high group 
get the item correct and the chi square value is significant, then one 
should consider the item to have discriminating power. To the 
person reading the book, the lack of making this point clear could 
cause some confusion. 
_ In Chapter 4 on statistics, two p 
in the computing guide for £, a foo 
numbers, —1.70 is less than —1.65.” (р. 134). Ho i 
true as far as the £ value is concerned. А t value of —1.70 is larger 
than —1.65. The minus sign only indicates the direction of the 
differences in the sample means. In the first computing guide for 
chi square, the basic formula given for the 2 X 2 table is the one 
where each cell is labeled A, B, C or D and then the manipulations 
are made with sums and multiplications of these values. However, 
for tables greater than 2 X 2, the more traditional expected minus 
observed frequency formula is then presented. It is my feeling that 
the more familiar formula should have been presented for all chi 
square computations. However, the alternate formula could be given 
as useful when 2 X 2 tables are being used. I do think though, that 
the more common formula should be the basic one that is started 
with, and that giving it will allow the reader to better grasp the 
basic idea of doing chi square. 


oints should be mentioned. First, 
tnote says “concerning negative 
wever, this is not 
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With the exception to the few minor points mentioned above, my 
general reaction to the Handbook is very favorable; in fact it seems 
to me to be the best book covering educational research material 
to come out for a long time. The major strength lies in the chapter 
on research design. Isaac and Michael have produced an excellent 
practieal guide for the audience that they intended it for. As 
someone said, most textbooks eould be condensed by at least 50 per 
cent without any substantial loss in meaning. The current authors 
have done precisely that with what I consider to be a gain in useful- 
ness. I would strongly suggest that people interested in the research 
endeavor to investigate the contents of this well done book. 
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Robert B. MeCall. Fundamental Statistics for Psychology. New 
York: Harcourt, Brace Jovanovich, 1970. Pp. viii + 419. $9.50. 


Considering the fact that dozens of elementary statistics text- 
books are currently available, it may be viewed as presumptuous 
to offer yet another опе. But Robert McCall’s Fundamental Statistics 
for Psychology is a couple of standard deviations above the mean, 
and it is certainly worth examining by teachers of the introductory 
course in statistics for the behavioral sciences. 

The book has been designed with an eye to teachability and the 
psychology of learning, an orientation which more textbook writers 
might well adopt. Although a knowledge of high school algebra 
and geometry is sufficient mathematical background for 95 per cent 
of the book, the author recognizes the importance of logic, formulas, 
proofs, and statistical concepts. A review of symbols, fractions, 
factorials, exponents, factoring, roots, and interpolation is included 
in an appendix. 

The use of tabular inserts for more technical material and detailed 
proofs adds to the continuity and ease of reviewing the text. Repe- 
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tition of symbols and their names, explanations of tables both in 
the text and at the tables themselves, in addition to isolation and 
review of concepts and formulas, are other important procedures 
used by the author to insure understanding and retention. 

The 12 chapters of the book include the usual topies in descriptive 
and inferential statistics at the elementary level: frequeney dis- 
tributions, central tendency, variability, percentiles, regression, 
correlation, hypothesis testing, and t tests. In addition, one-way 
and two-way analyses of variance are covered in two chapters; 
nonparametric techniques in a long, 41-page chapter; and further 
topies in probability in the final chapter. Almost twice as many 
pages are devoted to the last six chapters, which deal with 
statistical inference, as to the first six chapters on descriptive sta- 
tistics. Chapters 7 and 8 on hypothesis testing are particularly well 
written. 

Exercises are placed at the end of a section—another useful 
pedagogical device—rather than waiting until the end of the chapter. 
However, the student must turn to an appendix to confirm his 
answers. Also at the back of the book are the customary statistical 
tables, a glossary of symbols, and an index. The book is attractively 
packaged in a blue and white cover. In sum, this newcomer should 
be a respected competitor on the elementary statistics book market, 
and one that this reviewer is happy to recommend. 

LEWIS В. AIKEN, JR. 
Guilford College 


Daniel N. Robinson (Ed.) Heredity and Achievement. New York: 
Oxford University Press, 1970. Pp. X + 441. $4.95 (paperback). 


x This is a collection of readings intended for an introductory course 
in behavioral genetics, with emphasis on the especially important is- 
sues of intelligence, and of racial differences in intelligence. Since it 
provides an informative introduction covering concepts in genetics, 
а course built around this book would not require any previous 
exposure on the part of students to genetics. The two readings by 
geneticists Hirsch and Dobzhansky, placed late in the book, are also 
rich in didactic material, and might well be read first along with 
the introduetion. It would be wise for the psychologist instructor, 
however, to know a bit more population genetics than the book pro- 
vides, and for the geneticist instructor to know much more about the 
ш controversy, statistics, and psychological measure- 
ment. 

The initial selections are studies which illustrate the genetics 
of maze-learning ability, spontaneous activity, avoidance condi- 
tioning, and memory in rats or mice. For establishing the basic 
Point of there being a genetic basis for behavior, these papers are 
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invaluable, if sometimes tiresome with minute experimental detail, 
These are followed by Gordon Allport, discussing traits, and David 
Rosenthal on familial concordance by sex for schizophrenia—a 
long paper so packed with close argument that it will be difficult 
for most students to follow. This section closes with a paper addressed 
to its theme, the inheritance of personality, by Gottesman. This is 
a clean, straightforward piece, although its indexes of heritability 
may now be somewhat dated. Jensen (1967) has presented a re- 
vised formula for heritability and has pointed out that unless 
corrections for attenuation are used, estimates of heritability are 
too low. According to Jensen, there also appear to be some peculiari- 
ties associated with heritability estimates of personality variables. 
In a later section, Beach’s call for cross-species comparative 
research, like an earlier one by Verplanck, focuses attention on 
profound differences in animal behavior that must be rooted in 
genetics, and Scott’s discussion of critical periods, such as in im- 
printing, presents many examples of acute genotype-environment 
interaction of a highly special sort. (Oddly enough, the editor 
fails to point this out, although he makes frequent mention of such 
interactions in other contexts.) Unfortunately, the critical periods 
model needs to be scrutinized carefully before its limited applicabil- 
ity to common differences in intellectual performance is apparent, 
and neither the article itseli—which ends on a seductive note about 
олон of "learning ‘not to learn’ "—nor the editor provides 


„А major part of the book deals with race and intelligence, both 
directly and indirectly. The editor distinguishes two camps (p. 3); 
going back almost thirty years to Boring’s terms, the “nativistic” 
and the “empiricistic.” “Environmentalistic” might have been & 
more neutral term to present to students, who will be unaware of the 
context of Boring’s use of these terms in 1942, and who probably 
have been Socialized to regard “empiricists” as the “good guys.” 
The introduction ridicules lay questions such as, “What fraction 
of intelligence is determined genetically ?” with the help of portentous 
but, cryptic references to gene-environment interaction and gene 
action (p. 4), instead of training the student to think in terms of 
heritability of IQ by particular populations in stated environments. 
Genotype-environment interaction really boils down to a statistical 
question in calculating heritability, and since available evidence 
(Jinks and Fulker, 1970; Jensen, 1970a) indicates that this com- 
ponent of the variance in ТО is negligible, it is not the bugaboo that 
the frequent allusions to it would have us believe. Most likely, 
these allusions are predicated on interactions that are dramatically 
apparent but which occur only far outside of the usual range of 
interest of the environmental variables concerned, for example, when 
they are lethal to the organism. It is also discouraging to students 
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be told, as the editor does, that the nature-nurture problem is a 

Е. ion,” without leading them to think in terms of 
` heritability and components of phenotypic variance. If it were а 

ion, this book would hardly be necessary. 

- — *It would be highly unlikely that very significant differences would 
exist among races in regard to those characteristics which are vital 
to survival—for example, ‘intelligence,’” the editor states (p. 13), 
and he quotes geneticsts Fuller and Thompson in support, who said, 
“.. . it is likely that natural selection tends to oppose the establish- 
ment of major heritable behavior differences between races.” But 
who decides what size difference is “very significant” or “major?” 
One standard deviation in IQ may be trivial on the scale of nature, 
although of considerable consequence on the scale of human affairs. 
Tn these same passages (p. 13), the editor advances some extremely 
dubious assertions in an attempt to account for phenotypic racial 
differences—almost as though they were nongenetic—brought about 
by selection pressure, and ends by suggesting that “the transloca- 
tion of these racial genotypes to cultures calling for very different 
forms of intellectual expression could place the racial minority at 
something of а disadvantage. However . . . the relocated race would 
contain genotypes whose norm of reaction surely allowed adaptation 
new requirements, even if it preferred some slightly different 
form of expression.” This appears to be simply yet another a priori 
attempt to define any genetic differences that might be established 
аз unimportant, instead of talking about their actual possible 
magnitude. The use made of “norm of reaction” in this connection 
strikes me as wishful, as does the vague reference to the adaptable 
genotypes, without consideration of their relative frequency. — 

It seems to be the prevailing impression that any geneticist is 
automatically better qualified—as though some kind of Guardian 
of DNA—than any social scientist to discuss these issues, although 
some cultural anthropologists claim that they are the ultimate 
authorities (Diamond, 1962). Accordingly, the editor attempts to 
trump Jensen by playing the geneticist Hirsch, who has “provided 
а one-paragraph qualification of the facts and views” of Jensen 
(p. 222). In this paragraph, Hirsch instructs Jensen, with the 
help of numerous exclamation points and sarcastic asides, not only 
about genetics, but also about defining race, heritability, and in- 
telligence, and about the education of the disadvantaged as well. 

ге personally acquainted with Jensen’s careful and thoughtful 
consideration of all of these issues will recognize the injustice 
being done here, not just to the scientist, but to science itself. 

Robinson follows this with a serious misstatement of fact (p. 

). He says that 25 per cent of the black population exceeds the 
mean IQ of the white population, whereas the correct value has 
given by Shuey (1966, pp. 501-502) as 11 per cent. He trifles 
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with the problem posed by some northern blacks scoring higher 
in IQ than some southern whites by ignoring the possibility of 
selective migration and archly asking whether "genotype changes 
with latitude?" Treating а supposed association between school 
expenditures and child's IQ in the same manner, and ignoring the 
association between SES and IQ as a potential source of spurious- 
ness, as well as the failure of the Coleman Report to find important 
relations between school variables and pupil achievement, he asks, 
“Does genotype vary with educational expenditures?" The answer 
to both questions, of course, is quite possibly, yes. In my opinion, 
the purpose of an introductory text should be to discuss such issues, 
not to pose polemical questions left unanswered. A bit further on 
(p. 223), Robinson reports that monozygotic twins, “reared in very 
different environments, will reveal average IQ differences of fourteen 
points.” However, he omits to state that the very same test showed 
an average difference of nine points for monozygotic twins reared 
together (Gottesman, 1968), and so only about five points of the 
fourteen could be attributed to the difference in environments be- 
tween families. Since a comprehensive review of all studies of IQ 
differences between identical twins reared apart shows that the 
grand average difference is only 6.6 points (Jensen, 1970a), the 
large difference of nine points reported by Gottesman even for iden- 
tical twins reared together suggests, as we might expect, that the IQ 
test in question (Raven’s Mill Hill Vocabulary Scale) was less 
reliable than the Stanford-Binet or Wechsler-Bellevue, which have 
been used in other such studies, When Jensen (1970a) pooled the 
1Q’s from the Mill Hill with those from another short IQ test given 
at the same time, thereby enhancing the reliability of the final 10, 
the average difference for these twins reared apart became 6.72, 
which is quite comparable to values observed in the other major 
studies of such twins, using longer tests (Jensen, 19702). In 
evaluating the average difference in IQ between monozygotic twins 
reared apart, furthermore, it is always necessary to take into 
account the component due to measurement error, as reflected in 
average differences between two testings of the same individual 
with alternate forms; these differences average 4.68 for the Stanford- 
Binet (Jensen, 1970a). One should also give attention to evidence, 
reviewed by Jensen (1969a, 1970a), that IQ differences between 
monozygotic twins seem to be associated with prenatal and other 
biological influences, rather than with the social environment. When 
all of these considerations are taken into account, Robinson’s 
use of the fourteen point difference is seen to be exceedingly mis- 
leading. Yet, this is exactly the kind of “fact” that will stick in 
students’ minds. 

Unfortunately, Jensen is not represented in this book. There аге, 
however, excellent readings by Burt, and by Erlenmeyer-Kimling 
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Jarvik, on heritability of IQ, which in combination with 
Hirsch's statement in his article that separate breeding 
p are "almost certain to differ" in relative frequencies 
of different alleles in their gene pools, could set thoughtful students 
‘thinking despite the editor's distractions. The 1960 review of 
logical studies of race differences, by Dreger and Miller, is 
also included. Like their later work, it bends over backwards not 
to draw any conclusions, and suffers consequently from a nomologi- 
eal shallowness. A selection by Wesman on the definition of intelli- 
gence defines it as “the summation of the learning experiences of the 


individual,” thereby receiving the editor’s endorsement. Nothing is 


said about intelligence as the capacity to learn or, in Jensen's 
work (1969a), as abstract reasoning ability. Wesman's definition 
seems to suggest that we can teach individuals all to be very intel- 
ligent, although IQ test performance has proven remarkably resistant 
to coaching, and the school performance of low IQ children has 
been equally hard to boost on a permanent basis. 

Many readers will be irritated by the number of times that 
intelligence is placed in quotation marks, or referred to as “tt,” in 
various places. This adds nothing but mystification. Most will also 
find Hirsch’s attack on the mean, and concern with other parameters 
such as skewness and variance, to be equally excessive, even for 
the purpose of discrediting “typological” thinking. The mean, 
after all, is the statistic that best summarizes all of the observations 
in the distributions in question, and one-way ANOVA is known to 
be quite robust for slight differences in variance. 4 

А teacher of behavioral geneties will be able to use this book if 
he supplements it with other readings so ав to balance the picture 
and remain current. I say this with ambivalence, because it 
mean giving wider circulation to Hirsch’s article, in which he put 
Words in the mouth of the psychologist Garrett that are sufficiently 
removed from what Garrett actually said to constitute an act that is 
at least mildly vicious. Suggested supplementary reading would 
include “must” papers by Jensen (1967, 1969a, 1969b, 19690, 
19702, 1970, 1971), and papers by De Lemos (1969) and Garron 
(1970), the latter two dealing for a change with nonverbal and 
quantitative abilities. Students should also be exposed to the work 
of Lesser, Fifer and Clark (1965), which shows cognitive profiles 
unique to different ethnic groups, but constant across SES. Palmer 
(1970) and Lane, and Albee and Doll (1970) have shown some 
environmental differences that do not make & difference 1n IQ, 
including having a schizophrenic parent. Some policy. considera- 
tions are treated well in Jensen (1970c, 19704), and Bereiter (1970), 
and moral issues are sensibly discussed in Bressler (1968), Brues 
(1964), Ingle (1970), and Scriven (1970). Important topics re- 

| lated to the validity of ability tests for disadvantaged groups are 
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covered in Stanley (1971) and Sattler (1970). А collection of 
reading from a different perspective appears in Kuttner (1967), 
and if one wishes a really sweeping overview by an outstanding 
geneticist, there is Darlington's (1969) book. Finally, for those 
who would like to give students a whiff of diatribe, there is Alfert 
(1969a, 1969b) , followed by Jensen’s replies (19694, 1969e). 
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Joseph A. Steger (Ed.). Readings in Statistics for the Behavioral 
Sciences. New York: Holt, Rinehart and Winston, 1971. Pp. 
Ix + 406. $5.95 (paperback). 


This book of readings contains thirty-three articles divided into 
five chapters. The editor’s stated purpose for the book is “. . . to 
supplement the basic courses in statistical methods and research 
design, or other undergraduate or first level graduate courses 
(р. v). The designated audience is “. . . those who are not statisti- 
cians but who use statistics as tools in their field of study" (p. Y). 

Chapter one, entitled “Measurement and Statistics,” is con- 
cerned with scales of measurement. A presentation by Stevens 
(1951) of his four scales of measurement is placed first, followed 
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by three additional articles which argue for less demanding pre- 


scriptions, 

Chapter two deals entirely with Chi Square. It begins with the 
lengthy article by Lewis and Burke (1949) on “The Use and 
Misuse of the Chi Square Test,” followed by five “comments,” 
“replies,” and “notes” defending or disputing the notions of Lewis 
and Burke. A fairly technical article by Cochran (1954) ends this 


Chapter three is brief, dealing with “Parametric Techniques.” It 
includes an historical article by "Student" (no reference) on the 
Lanarkshire milk experiment, an article dealing with transforma- 
tions, and one dealing with analysis of covariance. 

Chapter four is the book’s lengthiest. Entitled “Assumptions and 
Statistical Inference,” this chapter divides into three sub-sections. 
The first contains seven articles and notes on one-tailed vs. two- 
tailed hypothesis testing. The second focuses on null hypothesis 
testing, with six articles. The final sub-section contains two articles 
dealing with the effects of violations of assumptions in parametric 
tests, specifically, t and ANOVA. 

Finally, Chapter five is entitled “Potpourri,” and, as the title 
suggests, contains four unrelated articles. The most noteworthy 
occupant of this chapter is Walker's classical presentation of “De- 
grees of Freedom” (1940). 

Because the editor makes a strong plea for use of the book as & 
supplementary text, it seems most reasonable to evaluate it primarily 
A pedagogical grounds. A brief section preceding the readings, 

On Learning Statistics," is included apparently to enhance the 
book's value to а learner. This section attempts to convince students 
that in spite of their preconceptions, they can learn statistics. It 
also contains a list of commonly used symbols. This attempt, albeit 
brief, at making the book student-oriented is commendable. How- 
ever, other pedagogical devices, such as end of chapter summaries, 
objectives, study questions, etc., were noticeably absent. On the 
sai side, a reasonably thorough index is included. 

editor wrote an introduction to each chapter, but they tended 
to be all too typical of books of readings, consisting essentially of 
summaries of the ensuing articles with little integration. In one 
noteworthy place, an “everyday” example was well used to introduce 
бйле» б eire кыы within this same section вно 
fined. was 

artic е and hence, probably not. колы also covered by the 
ecting articles for a book of readings in statistics with the 
purpose stated for this book is admittedly difficult. Articles from 
psychological journals (the major source for this book) are typically 
conceptually and/or mathematically difficult. Furthermore, often 
the material is quite specific and technical. Hence, finding articles 


which are both of general interest and understandable to а beginning 
student would seem to be difficult, if not impossible, 

Granting this difficulty, the reviewer feels the editor was not 
successful in achieving his goals, There seems to be few articles 


professional statistician, e.g., “Analysis of Covariance: Ite Nature 
and Uses” by Cochran (1957). The editor did state that “. . . the 
teadings have been edited to abridge the highly technical material 
where possible” (p. v). However, as well as this reviewer could 
assess, the “where possible" translated to "seldom" and resultant 
editing did not alleviate the problem to any extent. 

The editor included three controversial issues which proved 
an unmixed blessing. Showing students that statisticians do 
always agree, in fact, that they sometimes rival sophists 
energy they can devote to small metaphysical disputes, 
very worthwhile. Ritualized statistical practices are certainly 
too prevalent. However, these controversies take up much of 
book, leaving it ultimately unbalanced. 

Seventy-seven pages are devoted to the use and misuse of 
Chi Square statistic, The timeliness of this issue is reflected 
the fact that the latest article included was published in 1950! 
Twenty-nine pages focus on one-tailed vs. two-tailed hypothesis 
testing (the latest article was published in 1954!) This reviewer 
Personally found both of these sub-sections to be quite tedious. 
Finally, eighty-three pages confront null hypothesis testing, begin- 
ning with Rozeboom's “The Fallacy of the Null-Hypothesis Test," 


se FEE 


Statistics is a vast subject. Nonetheless, to exclude all of regress 
and strength statistics (e.g. т, eta, omega) without as much as an 
introductory mention of their existence seems indefensible. One gets 
the impression by implication, both from the title and from the pref- 
асе and introductions, that this book covers most of Mates үр 
by behavioral scientists. In fact, hypothesis testing is emphasized at 
expense of almost all else. n 
There is another count on which the title is misleading. The book 
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does not seem to be written for the behavioral scientist, but rather 
for the psychologist. This can be witnessed by the heavy use of 
material from psychological journals and books (26 out of the 31 
cited articles). Moreover, it is highlighted by Dukes’ article 
(“N = 1,” 1965) which is completely concerned with psychological 
research. 

In sum, the editor set out to put together a book of readings for 
“non-statistician” statistics students, from basic to graduate. The 
book does contain some excellent articles, one which students would 
do well to read. The inclusion of several controversies also has merit. 
Nonetheless, for reasons given above, this reviewer feels that the 
book has limited pedagogical value. The basic idea of a supple- 
mentary text for a given subject which contains writings of people 
“in the field” seems sound. Whether this can be successfully ac- 
complished in statistics is not answered by this book. 
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? 1 Neither the reference of an article by Bartlett, “Тһе Use of 'Transform&- 
tons," nor the reference of the article by Student (mentioned earlier) was 
cited. These are very unfortunate printing errors. 
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COMPUTER PROGRAMS 


This section is provided for the early publication at the expense 
of the author of computer programs relevant to measurement in 
the fields of education and psychology. Customarily, а program 
should be expected not to exceed six or eight printed pages. Manu- 
scripts of four or fewer printed pages are preferred. Each manu- 
script will be carefully reviewed as to its suitability and aceuracy 
of content. In some instances an accepted paper may be rcturned 
to the author for possible revisions or shortening. The cost to the 
author wil be forty-five dollars per page plus ten dollars extra 
per page for tables, figures, and formulas. 


Authors are granted permission to have reprints made of their 
articles at their own expense. 


Manuscripts received up to September first will be considered 
for the Spring issue; manuscripts received between then and March 
first will be considered for the Autumn issue. 


All correspondence and duplicate manuscripts should be directed 
to: 


Dr. William B. Michael 
325 Callita Place 
San Marino, California 91108. 
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A THEORETICAL STUDY OF THE MEASUREMENT 
EFFECTIVENESS OF FLEXILEVEL TESTS! 


FREDERIC M. LORD 
Educational Testing Service 


А conventional test becomes a flexilevel test when modified so 
that the examinee follows these rules: 

1. Answer first a specified test item of median difficulty. 

2. After answering an item correctly, attempt next the easiest un- 
answered item of more-than-median difficulty. After answering 
an item incorrectly, attempt next the hardest unanswered item 
of less-than-median difficulty. 

A special answer sheet is used so that the examinee will know 
whether each answer is correct or incorrect. If the conventional test 
contains N items, the examinee taking the flexilevel test will attempt 
only n = (N + 1)/2 of these. A method for implementing flexilevel 
testing is described by Lord (1971b). 

Surprisingly, it appears that number-right scoring is quite effective 
for flexilevel tests (Lord, 1971b), in spite of the fact that different 
examinees answer different sets of items. А worthwhile refinement, 
used throughout the research reported here, is to add one-half score 
point to the number-right score of each examinee who answered his 
last-attempted item incorrectly. 

A crucial question is whether flexilevel testing will be too confus- 
ing or too time-consuming for many examinees. Empirical studies 


1 This research was sponsored in part by the Personnel and Training Re- 
search Programs, Psychological Sciences Division, Office of Naval Research, 
under Contract No. N00014-69-C-0017, Contract Authority Identification 
Number, NR No. 150-303, and Educational Testing Service. Reproduction in 
whole or in part is permitted for any purpose of the United States Govern- 
ment. 
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are needed to answer this and other questions of practical effective- 
ness. 

Since a theoretical study can be done more quickly and less ex- 
pensively than a substantial empirical study, the study reported 
here was carried out in order to evaluate various flexilevel tests from 
a theoretical point of view. An important purpose was to try to sepa- 
rate some flexilevel designs that are worth trying out empirically 
from those that are altogether inferior to other tests. 

In order to carry out a theoretical investigation of this type, it is 
necessary to be able to predict probabilistically how a given exam- 
inee will respond to items different from those already administered. 
Consequently, the present results are derived from item characteris- 
tic curve theory (see, for example, Lord, 1970, sections 3-4). 

Here we assume the probability Р; that a given examinee will 
answer item i correctly depends only on his “ability” level, denoted 
by 6, and on certain item parameters: a (“discriminating power"), 
b (“difficulty”), and c (“pseudo chance-score level"). These item 
parameters are assumed to have been already determined, to an 
adequate approximation, by pretesting. 


Conditional Frequency Distribution of Test Score 

We can evaluate any given flexilevel test once we can determine 
f(z | 6), the conditional frequency distribution of test scores x for 
examinees at ability level 0. Given some mathematical form for 
the function P, = P, (6) = Р(0; а, bi, ci), the value of f (x | 0) can 
be determined numerieally for any specified value of 0 by the 
recursive method outlined below. 

Assume the N test items to be arranged in order of difficulty, as 
measured by the parameter b;. We will choose N to be an odd num- 
ber. For present purposes (not for actual test administration) 
identify the items by the index i, taking on the values —n + 1, 
=n + 2, +++, —1,0, 1, +++, n — 2, n — 1, respectively, when the 
items are arranged in order of difficulty. Thus bọ is the median item 
difficulty. 


Consider, for example, the sequence of right (R) and wrong (W) 
answers 


B Won ydo WERT cer W R. 
Following the rules given for a flexilevel test, we see that the corre- 
sponding sequence of items answered is 
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i= 0, +1, —1, —2, +2, —3, +3, +4, +5, —4, +6. 
The general rule is that if item 1 is the vth item administered and 
item j is the (v -- 1)th, then, for flexilevel tests, 
either j=i+1 o j=i—v when i20, 
either j2i—1 or j=i+v when i <0. 


In the same context, let Py. = Py..(@) denote the probability 
that item j will be the next item administered after item i. 


Р(0) if j= +1, 

If $20, Р. = 10,00) if ја, 
0 otherwise. 

P(0 if j—i-v, 

If $€0, Pq.-1Q(0 if j-3—1, 
0 otherwise. 


For examinees at ability level 6, let p,(i | 8) denote the prob- 
ability that item is the vth item administered. Clearly, 


p.610 = È pl Pa. Ф 
Now, the first item administered (v = 1) is always item i = 0, so 
| if i=0, 
nel =]; Ж 
0 otherwise. 


Starting with this fact and with a knowledge of all the P,(@) 
(determined from pretest data), equation (1) allows us to compute 
the values of p, (i | €) for each i, for v = 2,3, ++- , 7, and for any 
specified set of values of 6. 

Now we can make use of a readily verified feature of flexilevel 
tests. Again let j represent the (v + 1)th item to be administered. 
If j > 0, then the number-right score т on the v items already 
administered is r = j; if j < 0, thenr =v + j. 

Thus the frequency distribution of the number-right score r for 
examinees at ability level 6 is given by Р»: (r | 6) for those exam- 
inees who answered correctly the nth (last) item administered, by 
Pna(r — n | 6) for those who answered incorrectly. This frequency 
distribution ean be computed recursively from (1). 
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As already noted, the actual score assigned on a flexilevel test is 
x = т if the last item is answered correctly, = = r + % it is 
answered incorrectly. Consequently the conditional distribution of 
test scores is 


fe | 4 = |: Parila | 0) if x isan integer, (2) 
n+ı(7 —n — ‡ | 0) if = isahalf integer. 
For any specified test design, this conditional frequency distribution 
f(x | 6) can be computed for т = 0, 15, 1, 115, --- , n for various 
values of 6. Such distributions constitute the totality of possible 
information relevant to evaluating the effectiveness of z as a meas- 
ure of ability. 


Evaluating a Flexilevel Testing Procedure 


If we are to use x as a measure of ability, we would like ш|б 
(the mean of x when 6 = 6,) to differ from iz |05 whenever 6; 7^ 6». 
It seems natural to use the “critical ratio" 


№219, — Mzio, 
ГАП) 
to summarize the effectiveness of x for discriminating between abil- 
ity levels 6, and 6) = 6, + A, where o,|9 is the conditional standard 
deviation of z and A represents a small increment in ability (small 
enough so that о„|Ө = ог|в.л approximately). 
Actually we will work with the square of this ratio: 


L(- (шеъ ыт z10) ; (3) 


[^T] 
where k is any convenient constant, Given some small increment 4, 
I.(0), as a function of 0, is readily computed from (2) for any 
specified test design. Since we are only interested in comparisons 
between designs, the values of k and A are of no importance so long 
as they are the same for all designs compared. 


T'est Designs Studied 


The numerical results reported here are obtained on the assump- 


tion that P; is a normal ogive, possibly modified to accommodate 
the effects of success due to guessing: 


Р, = P(0;a, bud =с+01-0 f о NO 


Ё 
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where ¢ (£) is the normal density function. The results would pre- 
sumably be about the same if P, had been assumed logistic rather 
than normal ogive. 

To keep matters simple, we will only consider tests in which all 
items have the same discriminating power, a; also the same pseudo 
chance level, c. Results are presented here separately for c = 0 (no 
guessing) and с = .2. The results are general for any value of a > 0, 
since a can be absorbed into the unit of measurement chosen for the 
ability scale (as will be noticed for the base line shown in the 
figures). 

In all tests studied, each examinee answers exactly n = 60 items. 
For simplicity, we will consider only tests in which the item dificul- 
ties form an arithmetic sequence, so that bı — b, = d, say. 


Results for Tests with No Guessing 


Figure 1 compares the effectiveness of four 60-item (n = 60, 
М = 119) flexilevel tests with each other and with three bench 
mark tests. The scale chosen for 6 in the figures is such that for 
typical achievement and aptitude tests the standard deviation of 
6 in typical high school and college groups will be very roughly 
og = 1/2a (a more detailed explanation is given in Lord (1971a). 

The “standard test" is a conventional 60-item test composed 
entirely of items of difficulty b — 0, scored by counting the number 
of right answers. There is no guessing, so c — 0. The values of a and 
€ are the same for bench mark and flexilevel tests. For fixed a and 
с, no test composed of dichotomously scored items with charac- 
teristic curves (4) can have a higher value of Is (6) at any 6 
than the standard test has at 6 = by (see Birnbaum, 1968). 

As would be expected, the figure shows that the standard test is 
best for diseriminating among examinees at ability levels near 
9 = 0. If good discrimination is important at € = +2/2а or 6 = 
+3/2а, then a flexilevel test such as the one with d = .033/2a or 
d — .050/2a is better. The larger d is, the poorer the measurement 
at 0 = bo, but the better the measurement at extreme values of б. 

Suppose the best possible measurement is required at 0 = #2, 
with a = 0.5. It might be thought that an effective conventional 
60-item test for this limited purpose would consist of 30 items at 
b = +2 and 30 items at b = —2, The curve for this last test is 
shown in Figure 1. The fact is that with a = 0.5, no unpeaked test 
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standord 
/ 
4».033/2а 


жо .050/2a 


half at b*-2/20,|half at b=+2/2a 


half at b=-2.8/2a,| half c! b=+2.8/2a 


2a 20 2a 2a 2a 2a 


Figure l. Relative efficiency of four 60-item flexilevel tests with b, = 0 
(curves with ds) and three bench mark tests. c — 0. 


(i.e, no test with items at more than one difficulty level) can 
simultaneously measure as well as both 0 = +2 and 6 = —2 as 
does the standard test (which has all items peaked at b = 0). 
The situation is different if the best possible measurement is ге- 
quired at 0 = +8, with a = 0.5. Using dichotomously scored items, 
the best 60-item conventional test for this purpose consists of 30 


FREDERIC M. LORD ЫП 


items at b = —2.8 and 30 items at b = +2.8, approximately. The 
curve for this test is shown in Figure 1. 

For fixed 0, the number-right score т on a standard test has а 
binomial distribution. Thus, the expected score is 


Bus = nP 


and the variance of the scores is 


с" = nPQ, 


where P = Р(0) is given by (4). It is apparent from (3) that 
1,(6) for a standard test is proportional to n, the test length. 

We now see that when a = 0.5, the 60-item flexilevel test with 
d = .033 gives about as effective measurement as a 

58-item standard test at 0 = 0, 

60-item standard test at 0 = +1, 

69-item standard test at 0 = +2, 

86-item standard test at 6 = +3. 

At 6 = +3, the 60-item flexilevel test with d = .1, is as effective as 
a 96-item standard test. 


Results for Tests with Guessing 


Figure 2 compares the effectiveness of three 60-item flexilevel 
tests with each other and with five bench mark tests. All items have 
с = 0.2 and all have the same discriminating power a. The standard 
test is a conventional 60-item test with all items at difficulty level 
b = 0.5/2a, scored by counting the number of right answers. 

Tf all the item difficulties in any test were changed by some con- 
stant amount Ab, the effect would be simply to translate the cor- 
responding curve by an amount Ab along the 6-axis. The difficulty 
level of each bench mark test and the starting item difficulty level 
b, of each flexilevel test in Figure 2 has been chosen so as to give 
maximum discriminating power somewhere in the neighborhood of 
6 = 0. 

The standard test is again found to be best for discriminating 
among examinees at ability levels near 0 = 0. At 6 = +2 the 
flexilevel tests are better than the standard test, which in turn seems 
to be better than any of the other conventional (bench mark) tests, 
although the situation is less clear than before because of the asym- 
metry of the curves. 
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1,09) 


standard ,b»-0.5 
b? -9 d=.033 


half ot b*-2.0 
half ot b*1.0. [^x „ 


8 
3 2 1 о 1 2 3 


2a E 2а 2а 2а 2a 


Figure 2. Relative efficiency of three 60-item flexilevel tests (curves with 
ds) and five bench mark tests. c — 02. (Numerical labels on curves are for 
а = 05) 


When а = 0.5 the 60-item flexilevel test with bo = —0.9 and 
d — .033 gives about as effective measurement as a 
58-item standard test at 0 = 0 
60-item standard test at à = +1 
70-item standard test at 0 = —2.0 or 0 = +2.25 
83-item standard test at 9 = +3 
114-item standard test at 0 = —3 
at 0 = —8, the 60-item flexilevel test with by = —13 and d = .067 
is as effective as a 137-item standard test. 


Conclusion 


Near the middle of the ability range for which the test is designed, 
a flexilevel test is less effective than is а comparable peaked con- 
ventional test. In the outlying half of the ability range, the flexilevel 
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test provides more accurate measurement in typical aptitude and 
achievement testing situations than a peaked conventional test com- 
posed of comparable items. This comparison assumes that 60 items 
are administered to each examinee. The advantage of flexilevel tests 
over conventional tests at low ability levels is significantly greater 
when there is guessing than when there is not. 

Empirical studies will be needed to answer such questions as the 
following: 

1. To what extent are different types of examinees confused by 

flexilevel testing? 

2. To what extent does flexilevel testing lose efficiency because of 
an increase in testing time per item? 

3. How adequately can we score the examinee who does not have 
time to finish the test? 

4. How can we score the examinee who does not follow directions? 

5. What other serious inconveniences and complications are there 
in flexilevel testing? 

6. Is the examinee's attitude and performance improved when 
a flexilevel test “tailors” the test difficulty level to match his 
ability level? 

Empirical investigations should study tests designed in accordance 
with the theory used here. Otherwise, it is likely that a poor choice 
of d and especially of bọ will result in an ineffective measuring in- 
strument. 

The most likely application of flexilevel tests is in situations where 
it would otherwise be necessary to unpeak a conventional test in 
an attempt to obtain adequate measurement at the extremes of the 
ability range. Such situations are found in nationwide college 
admissions testing and elsewhere. 
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A SHORT CUT TOWARD A SUBMATRIX CONTAINING 
ONLY "DISTURBED" INDIVIDUALS! + 


LOUIS L. McQUITTY 
University of Miami, Coral Gables 


Two recent studies found “normal” individuals relatively like 
one another and “disturbed” individuals relatively unlike one 
another, with intermediate degrees of association between members 
from each of these two categories. 

The above differences express themselves most clearly when the 
comparisons are restricted to the relatively large and relatively 
small indices between individuals. A computer program was devel- 
oped for comparing every submatrix with every other submatrix 
and a criterion was developed, applied, and found helpful for deter- 
mining which pair would most likely yield the best separation of 
“normal” and “disturbed” individuals (MeQuitty, Banks, and 
Frary, 1970; McQuitty, Banks, Frary, and Aye, 1972). 


Hypothesis 


The above approach is so elaborate that it can analyze only 
small matrices (approximately 16 X 16). A simple way is here 
developed and illustrated for continuing to divide and redivide a 
matrix of inter associations between “normal” and “disturbed” sub- 
jects into submatrices which are hypothesized to contain only “nor- 
mals” in one submatrix and only “disturbed” subjects in the other 
submatrix. 


1 This investigation was supported by Public Health Service Research Grant 
No. MH 14070-03 from National Institute of Mental Health. aa 
2 Appreciation is expressed to Elizabeth Ann McQuitty, who performed 
of the calculations by pencil and paper on the small sample of data (except 
the agreement scores of the original matrix). The calculations were later con- 

firmed by the development of a computer program. 


815 


816 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
First Study—Small Sample 


The Data 


The method is illustrated with the data of Table 1 which reports 
agreement scores between eight “normal” and eight “disturbed” 
subjects. 

Clinicians selected both categories of subjects from among those 
seeking counseling from the Counseling Center at Michigan State 


University during the academic year of 1966-67. The “disturbed” 7 
subjects were chosen as in need of psychotherapy and the “normal — 


subjects as not in need of psychotherapy. 

Each subject completed a test illustrated by the following two 
items: "The word mother suggests hope. “yes no ?” and “The word 
father suggests hate. yes no ?.” 

There were 13 concepts (1. control, 2. self, 3. marriage, 4. religion, 
5. father, 6. achievement, 7. woman, 8. closeness, 9. distance, 10. 
dependency, 11. sex, 12. man, and 13. friend) and ten emotions (1. 
fear, 2. loneliness, 3. love, 4. hate, 5. guilt, 6. hope, 7. anxiety, 8. 
anger, 9. frustration, and 10. depression). Every concept was as- 


sociated with every emotion as illustrated in the above two items to 


yield a test of 130 items. 

Any two subjects have an agreement on an item if and only if 
both answer “yes,” “no,” or “?” on the item. Their total agree- 
ment score is the number of items on which they agree. The agree- 


ment scores of every subject with every other subject are reported 
in the matrix of Table 1. 


The Method 


The two largest and the two smallest entries of every column of 
Table 1 are underlined, summed, and reported in Row a at the 
bottom of the table. (Even in the case of a tie, only two of the 
larger and only two of the smaller scores are underlined.) In the 
general case about one-fourth to one-third of the entries in each 


column are underlined, one-half of them the larger entries and the _ 


other one-half of them the smaller entries. 


The sums of Row a are ranked in Row b, with the largest sum 4 


assigned a rank of one. 


If d (the number of “disturbed” subjects) is greater than OT . 
approximately equal to n (the number of *normals"), one-half of. К: 


LOUIS L. McQUITTY 
RRESESZRESZEPS аз 


S|ZSASRRASESESERR SET 


=|SSAZSSeAezaNa RRS” 


TABLE 1 
A Matrix of Agreement Scores between “Normal” and “Disturbed” Subjects 


Classification 
Code Numbers 


s58222288529 Z858^ 
HEARERS BRERA 
взчсакакав зп 
s|gzsssses8 B532 
ss3zHs5i 82088385" 
saszsss gseteseeag- 
329225 гыза 
$9523 5sSEPziz3sg" 
sasa s8agzetesess° 
ses каласа д2 


часо чоон осоо с о ж 


N = "Normal"; D = "Disturbed"; Row a = sum of underlined entries; Row b = Ranks of entries of Row а. 
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d subjects, those with the larger numerical ranks (smaller sums), 
are withdrawn from the matrix and hypothesized to be “disturbed” 
subjects. In the present case, these are Subjects 3, 4, 12, and 16, with 
ranks of 16, 13, 14, and 15 respectively. All of them are in fact 
“disturbed” subjects. If the n is considerably greater than d, one- 
half of the “normal” subjects are withdrawn first (as illustrated in 
the next step of this example). 

The remaining subjects are entered in a new matrix, in this 
case a 12 X 12 matrix, as shown in Table 2, using the agreement 
scores from Table 1. 

The same steps as applied above to Table 1 are now applied to 
Table 2, except that when the number of “normal” subjects is greater 
than the number of “disturbed” subjects, approximately one-half of 
the n subjects are withdrawn and predicted to be “normal.” The 
subjects with the smaller numerical ranks (larger agreement scores) 
are withdrawn. In this case they are Subjects 1, 2, 9, 13, and 11, 
with ranks of 4.5, 2.0, 3.0, 1.0, and 4.5. They are all in fact “nor- 
mal” except Subject 9 who is “disturbed.” 

Five subjects, rather than four (one-half of the n subjects) are 
withdrawn and predicted to be “normal” because the fourth and 
fifth subjects are tied with a rank of 4.5. 

The remaining subjects are assembled in a new matrix, Table 3. 
In this case, there are seven subjects. The same steps as outlined 
for Table 1 are applied to Table 3, except that d' subjects (the 
number of “disturbed” subjects still predicted to be in the matrix) 
are withdrawn and predicted to be “disturbed.” Only the sums of 
the largest and smallest scores in each column are used in the pre- 
dictions because these two scores represent one-third of the entries 
in each column. Since four of a total of eight subjects had already 
been predicted to be “disturbed,” this left d’ equal to four. The 
four subjects with the higher numerical ranks (smallest sums of 
agreement scores) are withdrawn and predicted to be “disturbed.” 
They are Subjects 10, 5, 7, and 15, with ranks of 4, 5, 6.5, and 6.5, 
respectively. All but Subject 15 are in fact “disturbed.” The re- 


maining three subjects are predicted to be “normal.” All of them are 
in fact “normal.” 


Results 
Of the eight subjects predicted to be “disturbed,” seven of them 
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TABLE 3 
Agreement Scores of Table 2 without those Subjects Predicted by Table 2 
to be Normal" 
Classification N N D N D D N 
Code Number 6 8 10 M 5 7 15 
6 90 87 84 S4 87 92 
8 90 96 98 93 84 7 
10 87 96 86 86 76 78 
14 84 98 86 92 90 79 
5 54 93 86 92 86 73 
7 87 84 76 90 86 75 
15 ШИ ООЛО A 75 
Row a 176 177 172 177 166 165 165 
Row b 3 1.5 4 1.5 5 6.5 6.5 


N = "Normal"; D = "Disturbed"; Row а = sum of underlined entries; Row b = ranks of 
entries of Row a. 

are “disturbed” as evaluated by clinicians, and of the eight sub- 
jects predicted to be “normal,” seven of them are “normal” as 
evaluated by clinicians. These results are identical, subject for sub- 
ject, with those obtained by the more elaborate method (McQuitty, 
Banks, Frary, and Aye, 1972). Eighty-eight per cent of the subjects 
were correctly classified to yield a phi coefficient of 0.75 and a chi- 
square of 9.00 with a significance of .0016 on a one-tailed test. 


Interpretation 


Process of comparing every submatrix of d subjects (the number of 
“disturbed” subjects in a matrix) with the other submatrix of n 
subjects (the number of “normal” subjects in the matrix) to obtain 
that pair of submatrices which gives the best differentiation be- 


Its apparent success with that data might depend in part on 
“chance.” The method is, therefore, applied below to a larger and 
independent set of data. 

Second Study—Large Sample 
Theory 


The theory maintains that psychological disturbance expresses 


C^ ome, see, че 


” ^T 
1 ULLAM 
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itself in emotional components of intra-individual concepts, Ex- 
amples of intra-individual concepts are: (1) my behavior, (2) my 
heart, (3) my reputation, and (4) my attitude. Each sueh concept 
is presumed to have emotional flavors which differ from individual 
to individual. Psychological disturbance is presumed to express 
itself in many kinds of interrelationships among these emotional 
flavors and often in restricted and unique patterns. Psychologically 
disturbed individuals are thought to relate uniquely to both “nor- 
mals" and other "disturbed" individuals in these emotional flavors. 


The Test 


The above emotional flavors can presumably be tapped by such 
test items as the following ones: 


1. My behavior suggests hope. yes no ? 
2. My heart suggests love, yes no ? 
3. My reputation suggests guilt. ys по ? 
4. My attitude suggests sadness. yes no ? 


The test derived from 14 intra-individual concepts and 14 emo- 
lions, every emotion associated with every concept in the fashion 
illustrated above to yield 196 items. The emotions and concepts 
аге as follows: (1) hope, (2) love, (3) guilt, (4) joy, (5) sadness, 
(6) hate, (7) sympathy, (8) fear, (9) anger, (10) pride, (11) 
anxiety, (12) happiness, (13) respect, (14) distrust, and (1) my 
behavior, (2) my heart, (3) my reputation, (4) my attitude, (5) 
my soul, (6) my past, (7) my beliefs, (8) my conscience, (9) my 
religion, (10) myself, (11) my future, (12) my state of mind, (13) 
my feelings, and (14) my body. No statistical analysis has yet 
been applied in an effort to improve the test. 


` The Subjects 


The subjects were 144 undergraduate students from the Uni- 
versity of Miami, Coral Gables, Florida, who sought counseling 
voluntarily at the University Counseling Center between September 
13 and December 20, 1969. The test was administered before a 
Screening interview for the purpose of gathering routine information 
and assigning the subject to a clinical or vocational counselor. 

After having been seen at least twice by a counselor, every sub- 
ject was classified by his counselor as “disturbed” or “nondis- 
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turbed." In this context a "disturbed" subject was one who, in the 
opinion of the counselor, was experiencing a serious behavioral 
problem as a result of mental disturbance. Such problems ranged 
from those associated with the more acute forms of mental illness 
to neurotic reactions which inhibited academic performance or 
social interaction in a serious manner. A subject was not classified 
as “disturbed” if his problem was not judged to have a substantial 
effect on his overall performance as a student or his relationships 
with others. 

As a result of the above approach, 66 of the subjects were classified 
by counselors as “disturbed” and 78 were classified as “nondis- 
turbed" (i.e., “normal”). Every subject was classified by only one of 
six counselors. Table 4 shows the classifications of the ubsjects by 
the counselors. 


The Analysis 


The agreement score was computed between every subject with 
every other subject to yield a 144 X 144 matrix. It was analyzed 
seven times by the method described above, each time with the 
statistical decision based on a different number of entries in each 
column. In every case the decision was based on an equal number 
of higher and lower entries in each column. The first time it was 
based on four entries (the first and second largest entries plus the 
first and second smallest entries), the second time on one-fourth 
of the entries (the larger one-eighth plus the smaller one-eighth), 


etc., for one-third, one-half, two-thirds, three-fourths, and all en- 
tries. 


TABLE 4 
Classification of Subjects by Counselors 
SS Se eee 
Classificati 
Gu 3 cations 
Code N D 
Insane ied apicc aS a 10 So 
1 А 8 12 
2 B 25 8 
3 с 13 15 
4 р 14 8 
5 Е 13 22 
6 F 1 
Totals 78 66 


N = "Nondisturbed" subjects; D = “Disturbed” subjects. 


LOUIS L. McQUITTY 823 
Results 


Table 5 reports the percentages of correct selections when (a) 
one-half of the number of “disturbed” (33) were selected, (b) one- 
half of the number of “normals” (39) were selected, (c) 66 were 
selected as "disturbed," and (d) 78 were selected as "normal" to- 
gether with the phi coefficients, chi squares, and level of significance 
for the chi square using a one-tailed test. The latter category of 
statistics is reported twice, once when half of the selections had 
been completed and again at the end of the selections. 

The results obtained with one-fourth of the entries of every 
column are representative of or only slightly superior to those ob- 
tained when other numbers of entries were used. These results are 
summarized. The first two selections (which sought one-half of both 
the “disturbed” and the "normals") produced 78.8% and 79.5%, 
respectively, of correct classification as compared with 712% and 
75.6%, respectively, for all subjects of these two categories and 
73.6% for the two categories combined; a phi coefficient of .582 
for one-half as compared with one of .469 for all subjects; and 
corresponding levels of significance for chi square of .000020 and 
-0000038. 


Interpretation. 


The results are unusually encouraging; equal degrees of differen- 
tiation based on objective analysis of data from unimproved tests 
responded to by two highly similar categories of subjects have in- 
frequently, if ever before, been achieved in this area. The method is 
of such a nature that it is not directed to taking advantage of 
chance errors. Cross validation, for the usual reasons, is not re- 
quired. However, because of the unusual degree of differentiation, 
follow-up studies are desirable; very atypical results are possible 
by chance alone. 

The findings show that certain “disturbed” individuals are in 
general relatively unlike themselves and “normals” to such an extent 
Оп certain psychological tests that this fact can be used to differen- 
tiate between “normal” and “disturbed” subjects in unusually high 
agreement with clinicians, 

Two other methods based on the same or similar hypothesized 
relationships have been described elsewhere using the concepts of 

Spots” (MeQuitty and Frary, 1971) and “scoring matrices.” 
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Summary 


If a matrix of interassociations between individuals is known to 
contain d "disturbed" and n "normal" subjects, it can be divided 
into all possible pairs of submatrices of size d and n and a criterion 
сап be applied to determine the pair yielding the best classification 
of “normal” and “disturbed” subjects (McQuitty, Banks, Frary, 
and Aye, 1972). The present paper develops and illustrates a method 
many times shorter and found it to be more effective with an unim- 
proved test than any other objective method reported (except the 
above mentioned long method) in the analysis of psychological tests 
for differentiating “disturbed” from “normal” college students. 
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RELIABILITY OF MULTIPLE-CHOICE TESTS 18 THE 
PROPORTION OF VARIANCE WHICH IS TRUE VARIANCE 


EDWARD E. CURETON 
University of Tennessee 


IN a recent issue of this journal, Frary (1969) presents an anal- 
ysis which seems to show that classical weak true-score theory 
does not apply to multiple-choice tests. Starting with the equation, 


т={+0+е, (1) 


he shows that the variance ratios о;,;2/0:2 and o/os, the corre- 
lation between equivalent forms, and the square of the correlation 
between raw scores and true scores are all different. 
The difficulty with this derivation is that the guessing score g is 
not separated into а true component and an error component. 
Guessing tendency is a real trait on which individuals differ 

(e.g., Swineford, 1938, 1941; Ziller, 1957). But the actual amount 
of guessing which an examinee does on a particular form of a test 
depends also on the form's explicit content (as compared with other 
equivalent forms of the same test), and on the specific time and 
circumstances under which it is administered. The guessing-ten- 
dency behavior has limited reliability, and varies about the true 
trait score with changes in test form and in occasions on which the 
test is given. 

5 The content error e varies from form to form and from time to 
time, and so also does the guessing error. Let z; and = be the raw 
Scores on two forms of a test, t the content true score, g the 
Euessing-tendency true score, e; and ез the content errors of measure- 
Ment, and 8, and 8 the guessing-tendency errors of measurement. 
In place of (1) we then have 
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m= tt+gteat 14. 

If the two forms are administered simultaneously (as, e.g., the 
odd and even items of one test), the errors will be form-associated 
errors only. If they are administered at different times, each type 
of error will include an occasion-associated error as well as а form- 
associated error. . 

If the two forms are equivalent, it is assumed that ¢ and Т 
the same in both forms, that ¢ and g are uncorrelated with е, 
в, and 8, that e, and à are uncorrelated with es and 8, and аё 
the two forms are equally reliable and equally variable, so that 
vj! = es! = o,*. It із not assumed that either ё and g, or ey and 
&, or e; and à; are uncorrelated. We, therefore, simply associate 
the two true scores and the two error scores of each pair in equa- 
tions (2), so that 1 


zı = (t +g) + (6 + 4), @) 

Za = (t + д) + (e + ô). 

By the variance-ratio definition of reliability and the assumption 
of equal variance, we then have from either of equations (3), 4 


= { 

== e 

The correlation of х; with т» is 
Min Xtrg EXC-Q +B) EX ge b) CENCE 

Noo; ү 


Under the equivalence assumptions the last three terms in the 
numerator vanish, eı оз = o°, and 


2 Ду, 
fis = hD = IE , (5) 
which is the same as (4). Ж 
The correlation of ту with (Ё + g) is b» 
( д)” ( Ne ) i 
Tig = xe + at Eb n 
iVteg 


The second term in the numerator vanishes, c; = oz, and 2 


Cray = Tite 
© Oz 


Tics) = 
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which by (5) is \/rig. We would arrive at the same result for rap). 

As compared with free-answer tests, these results differ only in 
that the true score із the true content score plus the true guessing- 
tendency score. If the correction for guessing is not used we are 
simply measuring, with some error, a composite of the content 
knowledge or ability and the guessing-tendency trait, 

When the correction for guessing is used, most of the systematic 
error represented by g in, (2) and (3) is removed, but the guessing 
errors, 8; and 8, remain: they are intrinsic to measurement with 
multiple-choice tests. 

Instructions to limit guessing are peculiarly insidious. Partial 
knowledge of an item is real and substantial, and its use in answer- 
ing multiple-choice tests always involves guessing. If an examinee 
can eliminate one or two wrong alternatives, and guesses among the 
remainder, the odds are in his favor. If he has a hunch, he should 
play it: hunches are right with frequency greater than chance. 
Under guess-limiting instructions, examinees whose guessing ten- 
dencies are high ignore them and receive additional credit for par- 
tial knowledge, while examinees with low guessing tendencies 
heed them and receive no such credit. When the correction formula 
is used, partial knowledge is credited on the average (Cureton, 
1966), but to permit this, examinees must be instructed emphati- 
cally to omit an item only if an answer would be a pure guess. 

If every examinee is required to mark every item, the g-variance 
is reduced to zero at the expense of an increase in the $-variance. 
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THE PROBABILITY OF MISCLASSIFICATION OF 
STUDENTS ON MULTIPLE CHOICE EXAMINATIONS! 


WALTER Н. CARTER, JR? 


Medical College of Virginia 
Health Sciences Center 
Virginia Commonwealth University 


THERE is now extensive literature on methods of obtaining а 
student's true score on a multiple choice examination (Calandra, 
1941; Chernoff, 1962; Hamilton, 1950; Lord, 1953; Lyerly, 1951). 
Since, in many respects, such an examination is equivalent to ran- 
dom sampling, a majority of these procedures estimate this quantity 
statistically. This estimate is then used to classify the students in 
groups, such as honors, pass, fail, or A, B, C, D, F. However, since 
this grouping is based on a partieular value of the estimator, itself 
а random variable, any such classification of students will be sub- 
ject to error. Very little research has been devoted to determining 
the probability of misclassifying students as a result of using the 
various methods of estimating true scores. 

In this paper the probability of the misclassifieation of students is 
developed by a method which is independent of the particular proce- 
dure used to determine the aforementioned groups. To obtain this 
quantity it is necessary to assume the existence of a set of questions 
Which will accurately measure a student’s ability to answer ques- 
tions for which he does not know the correct answer. 

Krutchkoff (1967) has defined the separation level of grades, e.g. 


———E 
1This work was supported by a National Institutes of Health Institutional 
s Grant (5P07FR00016). 
І am indebted to Mrs. Lillian Kornhaber, Department of Biometry, 
Medical College of Virginia for the computational assistance she provided 
in the preparation of the tables which appear in this article. 
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A, B, C, D, F, as the probability that a student with a higher grade 
actually knew the answers to more questions than the student with 
& lower grade in an attempt to justify the use of multiple choice 
examinations. This would appear, at first glance, to be related to a 
probability of misclassification and indeed a simple function of it 
shall be used as an approximation. To arrive at an expression for 
the separation level of grades Krutchkoff has made two assump- 
tions: 

1. Partial knowledge plays no role in a student's guess at the 
answer to a question for which he does not know the correct 
response. 

2. The class of students taking the examination is homogeneous. 

That the first assumption is too restrictive can be seen from the 
following hypothetical example. Consider a student who does not 
know the correct answer to a given question which contains five 
possible answers, As a result of partial knowledge of the subject 
matter, he is able to eliminate two of the possible answers as in- 
correct. Hence, this student is now able to guess the correct answer 
with probability 14 instead of 15. For this reason, it will be assumed 
that partial knowledge plays an important role in a student's re- 
sponse to multiple choice questions. 

In what follows, a probability of misclassification will be derived 
which is based on each student's partial knowledge of the subject 
matter. It should also be noted that this derivation does not require 
assumption two above. Krutchkofí's results will then be compared 
to those obtained by the methods developed in this paper. To 
facilitate this comparison we adopt Krutchkoff’s notation. Let 

N = the'total number of questions, each with r possible answers: 

X = the proportion of subject matter known; 

W = the number of answers known; 

Y = the number of correct guesses; 

Z = the total number of correct answers; 

p = the probability of correctly guessing the answer to a question. 

The assumption is made here, as in Krutchkoff’s paper, that for 
each student the proportion of the subject matter known is almost 
normally distributed, that is, the proportion of the subject matter 
known follows a normal distribution truncated at zero and one. We 
choose to work with this truncation by concentrating the lower tail 
probability at zero and the upper tail probability at one. As 8 


tee 


o iin + و و‎ сел: ~ paw | _ 
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result of this assumption it can be seen that NX is almost nor- 
mally distributed with parameters а and o. It is assumed that the 
probability, p, of correctly guessing the answer to a question is a 
random variable conditioned upon the student's knowledge of the 
subject matter as measured by W, the number of answers known. 
It is further assumed that the value of p for a given W follows а 
Beta distribution with parameters a and 8; where а = h(W) and 
B = k(W). Both A(W) and k(W) denote unknown positive, real 
functions of W. It will not be necessary to estimate the functional 
form of these two functions, but for each student we shall arrive 
at an estimate of the value of a and 8. The estimation of these 
parameters will be discussed later. 


Theoretical Development 


The conditional probability mass function of Z for given W and 
p can be expressed as 


pz n = (0 рта =", Ф 
and from the definition of conditional probabilities we can write 


pa |W) = | PZ |W, p) dF | W). 


However, it has been assumed that 


а d» сы 


where " 
Ве, в) = [27 — f^ dp. 
Hence, 
_(N-W)\B@-W+aN-Z+8) 
Pe) E ү = “| Be, B) T 


From this distribution it is possible to obtain an expression for 
ям expected value of the total number of correct answers, E (Z), as 
Ollows: 
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E(Z — Ж) = EEZ — W) | WI) 

> LX. LU - (12 wee - ә“) (0 

ур) 

_ к|@ —ЮВ@ + 1,® |. 

В(а, 8) 
И N is large and the truncated tails are small 
EW =p, 

and 


ga - w) = Uz BEEF LA (5) 


Together, the last two expressions imply 
E(2) = (N — »)B@ + 1,8) + р, (6) 


Б Ве @ 
P 1- 8619 


As a result of the distributional assumption made for X, it is 
easily seen that 


Ayr = yea 
por) = (18) - (E), w=1,2, 5-10 
our 7=», 


where Ф(:) represents the value of the standard normal cumulative 
distribution evaluated at (-). 

Making use of the definition of conditional probability once again, 
we have 


| 
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An expression for P(Z | W), the ань that а particular 
student gave the correct answer to Z questions when he knew the 
answer to W questions has been obtained in equation (3). A quan- 
tity which will be very useful to us in calculating the probability of 
misclassification is P(W | Z), the probability that а student knew 
the answer to W questions when he gave the correct answer to Z 
questions. It turns out that P(W | Z) can be obtained from P(Z | W) 
and P(W) by means of Bayes’ inversion rule which yields 


ior و‎ - РЕЈ РРО) 


Z PZ | РРО) 
(Y = Faz - w +a N- 2- PW) 


pa WBZ- W + a, N — 2 + AP)‏ لوم 


As a result of (10) the Probability of Misclassification, PMC, 
сап now be obtained as 


- (10) 


Zs $-1 
PMC =1— DPW = #12) BPW = 112). D 
+=0 
The PMC is nothing more than the probability that, їп а compari- 
son between two students, the student who gave the correct answer 
to fewer questions, Z;, actually knew the correct answers to as many 
or more questions than the student who gave the correct answer to 
Zz questions (Zs > Zi), ie. 


PMC = PIW, > W: | Z: > Zi. 

Before the PMC ean be calculated we must obtain an estimator 
for c, a parameter in the distribution of W. Since Z = W + 5 
Var 7, the variance of 7, сап be expressed аз 

Var Z = Var (W + Y) 
= Var W + Var Y + 2 Cov (W, Y) 
From the assumed distributional form of W it is known that 


| Var W = о?. 
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In order to complete the expression for Var Z, it is necessary to 
write 
Var Y = E[Var (Y | W)] + Var [E(Y | W)] and 
Соу (W, Y) = ELE(W — „(Y — (N — 2p | №). 
Since the number of correct guesses equals the difference between 


the total number of correct answers and the number of known an- 
swers, it follows that 


Var [Y | W) = Var [(Z — W) | W] 


= E(Z — Wy | W) — Е — W) | W). 
Evaluating these expressions we find 


(12) 


E(Z — Wy | W) 
SS (N-W|BZ-W-toeN-Zt9 
- m -mT Bee, B) 
-SIS ZAW (N-W) ery ys 
[kx iz Ba, B) e "s v) 0-0 j 
p^ — p^ dp. 
For the binomial distribution 


E(Z — №)? = (N — W)pà — р) + (N – W)p 


and hence 


E((Z — Wy | W] 
_ (N — W)B(a + 1, 8 + 1) 4 (C = WBE + 1, 9, (13) 
BE (a, B) Be, B) 


It has been shown previously, equation 4, that 


ELY | W) = E(Z – W) | w] = WZ meet 1, B). 
Therefore 
Var [Y | W] = Var (2 — W) | W] 
N-— 
F x [ве + 1,8 + 1) + Ba + 1, 8) 


DN = WW) В (р) 
in dy 
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and 
; 80 
Е[Үаг (Y | W)] = ie -B [Be +1,8 + 0-26 1,8 


oF (N — MB a +1, 8) N= RFE tL | E (15) 
— Bs, B В(а, 8) 
To complete the expression for Var Y, we must calculate Var 
[E(Y | W)]. Using the expression given in equation (4) for 
E[Y | W] it can be shown that 


(N — W)B(a + 1, 8) 
Var [E(Y | W)] = Var [ х рете i ] 


к [26 a Be. 
Be, 8) 
Combining equations (15) and (16) in the manner prescribed by 
equation (12) yields 


(16) 


Var Y = (N =. (N — џ) 


Bea, B) 
N — Be + 1,8) |. 
[pe LET 1) + B(a + 1, 8) = м оке + 8| 


(17) 


We proceed next to find an expression for the covariance between 
the number of answers known and the number of correct guesses, 
Соу (W, Y). It follows from the definition of the covariance between 
two random variables that 


Cov (W, Y) = E[E(W — »)(Y — (N — шр) | W)] 


pec e wd d 
ID sel шшр 
B(a, 8) 


_ Be +1, 8) Ме E 
e EA E[(W = i)(u — W)] 
Si UB (aac АИВ) (18) 
B(o, б) 
Finally, we are able to write 
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dar (N — и) 
VarZ =o + Ba, 8) 
(N = p(B’ (a + 1, 8) 
[Ble + 1,8 + D+ Be + 1,5- x et 1 2] 
_ 2В(а + 1 а? (19) 
В(а, 8) ' 


which implies 


Varz- ватане EET oa 7 2] 


"x. 2B(a + 1, 8) 
B(a, 8) 


(20) 

Before а value for the PMC can be calculated u and ڍ‎ must be 

estimated. An estimate of u can be obtained by replacing E(Z) 

in equation (7) by 2 and c сап be estimated by replacing Var Z 
in equation (20) by s,”, where 


A 
Ze, (21) 
3 T2 22 
fh er re 5 


and k equals the number of examinations given to determine а 
student's grade. If k = 1, then the examination must be randomly 
divided into kı > 2 different sub-examinations in order to apply 
methods developed in this section. A similar procedure, called the 
split-half technique, is frequently employed when measuring the 
reliability of an examination. See, for example, Rosinski and Hamil- 
ton (1966). 


Estimation of a and 8 


In arriving at the expression for the PMC nothing was said about 
how to obtain values for a and 8, the parameters which appear in 
the distribution of a student's guessing ability. In this section We 
shall use à method due to Weiler (1965) to estimate these param- 
eters. 
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Since partial knowledge has been shown to play an important role 
in guessing and since guessing only occurs on questions for which 
the answer is unknown, it seems appropriate to include on the ex- 
amination several questions chosen so that the students would not 
be expected to know the answer, but chosen in such а manner as to 
allow a student's partial knowledge to help in arriving at the guessed 
answer and then to infer from a student's performance on this 
sample of questions the parameters of the underlying Beta distri- 
bution. As a result of the examination it can be determined that 
100P per cent of the time a student had a value of p above p; and 
another 100P per cent of the time had a value below ps. Hence, the 
following system of equations 


1 e s = 
mn] РОВЕР 
Ba, 8) А (23) 


Б, 8) i p* ‘(1 — р)^ dp = P. 


These two expressions can be solved simultaneously for the un- 
known parameters a and f by use of Pearson's Tables of the In- 
complete Beta Function (1934). 


A Single Examination 


The following example will illustrate the application of the proce- 
dure described in this paper to a single examination. Since the ex- 
tension to a series of examinations is immediate, it will not be dis- 
cussed further in this paper. The examination being analyzed is a 
section of a larger examination which was given to 127 first year 
medical students at the Medical College of Virginia in September 
1968. In order to obtain estimates of the parameters of the underly- 
ing Beta distribution, the examination was randomly divided into 
two parts such that on each part there was an approximately equal 
number of questions, five on the first part and four on the second 
Part, designed to measure a student's partial knowledge. Based on 
the students’ performances, it was decided to pass those who an- 
Swered more than eight of twenty-one questions correctly. Since 
there were five students who answered eight questions correctly and 
three who answered nine questions correctly, it was of interest to 
Obtain the PMC for these students. These misclassification prob- 
abilities and those calculated by Krutchkofi’s method appear in 
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Table 1. Notice that Krutchkoff's method only permits the calcula- 
tion of a misclassification probability of a grade, e.g. between those 
students who correctly answered eight questions and those who 
answered nine, as opposed to the students within a grade. 

Since students will generally perform differently on the set or sets 
of questions designed to indicate their degree of partial knowledge, 
it is possible, by applying the methods developed here, to calculate 
the PMC for students who have correctly answered the same num- 
ber of questions. Thus, we now have а method for ranking such stu- 
dents, The PMC has been calculated for the students who scored 
eight and nine on the test and the results appear in Tables 2 and 3 
respectively. 


Conclusion 


Since studente, for various reasons, do not possess the same levels 
of partial knowledge, they guess the correct answer to questions not 
completely known with different frequencies. In the past either no 
attempt has been made to account for this effect or it has been as- 
sumed that all students, when faced with a question they do not 
know, guess the correct answer with equal probability, 1/r. In this 
paper it has been assumed that the probability with which a person 
guesses the correct answer to a multiple choice question is a гап- 
dom variable that follows a Beta distribution with unknown param- 
eters. A method is given for estimating these parameters. A proce- 
dure, using the estimates of these parameters, is developed to 
obtain the probability of misclassifying students based on their 


TABLE 1* 
Probabilities of Misclassifying Students who Correctly Answered Eight and Nine 
Questions 


9(5, 4, 20.0, 75.0) 9(7, 2, 20.0, 75.0) 9(7, 2, 48, 47.2) 


8(6, 2, 4.8, 47.2) 0.839 2 
8(3, 5, 20.0, 75.0) 0.410 0408 0047 
8(4, 4, 20.0, 75.0) 0.448 0.516 0.063 
8(5, 3, 20.0, 75.0) 0.410 0.478 0.047 
8(7, 1, 20.0, 75.0) 0.477 0.545 0.077 


Krutchkoff's separation level = 1 — PMC(8, 9) = 0.558 PMC(8, 9) = 0.442 
* The entries in Tables 1, 2, and 3 are the PMC’s, i.e. in the i, Vae have tabulated the 
probability that student i actually knew аз many ог more correct answers than student 7. 
* 8(6, 2, 4.8, 47.2) denotes a student who answered eight. questions correctly, six on one-half of 
EU a on the other half, with parameters a = 4.8 and 8 = 47.2 in the assumed Beta 
ution, 
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TABLE 3 
The Probabilities Used to Rank Students who Correctly Answered Nine Questions 


9(5, 4, 20.0, 75.0) 9(7, 2, 20.0, 75.0) 9(7, 2, 4.8, 47.2) 


9(5, 4, 20.0, 75.0) 0.636 0.123 
9(7, 2, 20.0, 75.0) 0.082 
———$—$———————————— РӘ 


individual performances on multiple choice examinations. By means 
of illustration, it is further shown that this procedure can be used 
to rank students who gave the correct answer to the same number 
of questions. 
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NONPARAMETRIC ITEM EVALUATION INDEX?! 


STEPHEN H. IVENS 


College Entrance Examination Board 
Atlanta, Georgia 


Tue purpose of this paper is to develop a nonparametric index 
for evaluating the effectiveness of dichotomously scored items 
that takes into account bot the difficulty level and the discrimina- 
tion of the item. Although Davis (1951) did not recommend com- 
bining both item characteristies into a single index, Findley (1956) 
defended the procedure. Indices such as the biserial r, point-biserial 
т, and D (Findley, 1956) are dependent on item difficulty, while 
the rank-biserial r developed by Cureton (1956) and extended by 
Glass (1965, 1966) is independent of item difficulties. The criterion 
upon which this new index is based is that the best possible item will 
have a difficulty of .5 and have perfect discrimination. The .5 diffi- 
culty level was chosen because this value maximizes the number of 
discriminations an item can make, and a test composed of such items 
will have the greatest overall validity and reliability (See Gulliksen, 
1945, 1950; Lord, 1952; Cronbach and Warrington, 1952). Ad- 
mittedly, however, there are instances when this criterion would not 
be appropriate for item selection. 

Consider item i as one of k items in a test administered to № 
individuals. If these N individuals are ranked, either in terms of 
their total score or an outside criterion (with tied ranks randomly 
broken), from low to high, then the response vector X, for item 
1 would be 


ШЕРТ a err . 

Тһе original suggestion for this index came from T. S. Briley, Florida 
State University. The author is indebted to J. M. Laible, Eastern Illinois 
University, and to J. K. Brewer, Florida State University, for many helpful 
Suggestions given on the original draft of this paper. 
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X, = [а aa, --* , ax] a 

where 1 if the jth individual passed the item 
e = 
0 if the jth individual failed the item. 

Since the individuals were ranked from low to high, a спора 
vector of ranks, denoted Ry, would be 

Ry = [1,2,3, +++ , N). (2) 
The scalar product X/-Ry = Y, is the rank sum of those individ 
who passed item i, namely, 


x P 
Y, = Х.В = DX ja. (3) 

Let us define n, and n; as the number of individuals who failed 
and passed item i respectively, such that Р 


N = п + т, (4) 

and Ry as the median rank of vector Ry, 
Ry = (N + 1)/2. (5) 
An index, W,, can now be defined as | 


Wi = Y, + noy, (6) 
which will inerease in value as the discrimination of the item in- 
creases and as the difficulty of the item approaches .5. 

There are а total of 2" equally likely response vectors, X;, as m. 
takes on the values from 0 to N. Thus, the expected value of Wa 


denoted W, is derived by summing equation 6 over sll possible 
response vectors and dividing by 2". Hence: 


rr. = (Bie co `)) o 


Ӯ = 
Since 
У(Х UR 

CEP ELM N+1¢ 0) 


пот 1 


= N(N + 12%", 
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equation 7 reduces to 


= ASD, (9) 


The use of W, as an item index is not meaningful, however, un- 
less we know the maximum and minimum values that W, can 
achieve. These values can be obtained from the following inequali- 
ties; 


Mat 1) —-{N+(N—1)+---+(¥ = (л»„—1)]) S У jan 


апа 


N 
МАЮ (o. iss bn) д 


which hold for each value of no since Xja" ja, = N(N + 1)/2 
if a, — 1 for each j. Upon simplification the above inequalities be- 
come 


x -— 
W —29( — n De ух. < NO E) n(n, + 1) 
2 = 2 
Substituting from equation 6 yields 
„a Уу 
м+м = {Nea = te) < у, < N + N + We ne). 
When N is even, the maximum value for the upper bound of W, is 


_ 5N*+4N (10) 
8 


This maximum is achieved when ng = N/2 where the first N/2 a's 
аге zero. The minimum value for the lower bound is 


== 


Was SN AN an 


and is achieved when n, = N/2 where the last N/2 ajs are zero. 
When N is odd, the maximum and minimum values for W; are 


E QUE DEN +1 (12) 
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( )5N — 1) 
Ware а = +1 G = (13) 


The maximum value occurs when no = (№ + 1)/2 ог (N — 1)/2 
where the first (N + 1)/2 ог (N — 1) /2 «ўз are zero. Analogously 
the minimum value occurs when no = (N + 1)/2 ог (N — 1)/2 
where the last (N + 1)/2 or (№ — 1) /2 a's are zero. Thus, W, is a 
maximum when discrimination is perfect and difficulty is .5. 

УУ, is still not satisfactory as an index of an item's effectiveness, 
however, because repeated administrations of а given item would 
not yield comparable W, values unless N is constant over all ad- 
ministrations. This dependence on N can be eliminated if we subtract 
from W, the average value W and divide the difference by the 
maximum W, value minus W. This new value, denoted S;, equals 
the "status" of the items or symbolically 


W.-W 
Was — W 


For computational purposes, the above expression for S, can be 
simplified by referring to equations 6, 9, 10, and 12. When N is 


even 
8, = B ат) аз) 


and when N is odd 


S, = ЗҮ, + Lie (16) 


= 1 
The mean of the distribution of S, values, for each N, can be 


found by summing equation 14 over all possible response configura- 
tions and dividing by 2" to give 


8-Ysw- x (у= ae. 


Substituting equations 8 and 9 in the above expression and simplify- 
ing yields 


Š= Fy (ve + 19" = 2 AO ED) gr =0 


Referring to the definition of Y, equation 15 can be written 


ә 
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4 N 
в, = (a Èi me n) an 


and equation 16 can be written 


S, = vo (2 È in —n(N + )). (18) 


The variance of the distribution of S, values, for N even, is equal to 


16 x i 
es! As E( Die, — mv + 0) аў 
mi 
and for N odd, is equal to 


jo Gyr z( E -nw + 0). 0 


In order to obtain usable formulas for the variance of the distribu- 
tion of S, values it is necessary to simplify the expression 


N 

X (2 Èa- mN +1)) en 
imi 

which is common to both equations 19 and 20. It can be shown? 
that expression 21 is equal to 


NN + 1)(N = 1) „з, 
3 2*7, (22) 


Substituting this value in equations I9 and 20 yields 


est = 0), for N even; (23) 


з 
, 


апа 


2_ AN ч 

osè = yr? for М odd. (24) 

We have shown that the distribution of S;, for each №, has а 
known variance and is symmetrical about zero with maximum and 
minimum values of one and minus one respectively. For each No 
and m, S, is a linear transformation of Wileoxon's rank sum statistic 


к. 

3 The derivation of equation 22 from equation 21 has been deposited with 
the National Auxiliary Publications Service, c/o CCM Information Corpora- 
en. 909 Third Avenue, New York, N. Y. 10022. Copies of this material may 

obtained by citing document #NAPS-01602 and remitting $5.00 for photo- 
Copies and $2.00 for microfiche. 
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T (Wilcoxon, 1945). Summing over n, and n, we see that the 
distribution is linearly related to the sum of the various Т 
butions, holding N constant. In that the asymptotie distri 
of T is normal (Bradley, 1960, p. 136), the asymptotic distribution | 
of S, is also normal. Thus the probability of a given S, value differing. 
from zero сап be estimated by calculating S,/os, and comparing 
the value to tables of the normal distribution for larger N. 

The difference between S, and the rank-biserial index, rb, reported 
by Glass (1965, 1966) is due to the effect of item difficulty on 
The two indices will yield identical values for N even, whenever 
по equals m, and for N odd, whenever no equals n, + 1. Two ad- 
vantages of S, over rb are: (1) S, equals zero whenever по equals - 
zero or N, while rb is undefined in these two situations; and, (2) S, 
attains its maximum value of one from only one possible response 
configuration (two if М is odd), while rb attains a value of one 
under N — 1 possible response configurations. 

In summary, S, is an easily computed nonparametrie index that: 

1. is dependent on item difficulty and discrimination; 

2. has a known range and variance; 

3. has a significance test for its difference from Zero; 

4. can be meaningfully compared aeross different administrations 
of the same items; and, 

5. can be computed by using either the total score of the test in 
whieh the item is contained or an outside criterion. 


REFERENCES 


Bradley, J. V. Distribution-free statistical tests. WADD Technical 
Report 60-661, Wright Air Development Division, Wright- | 
Patterson Air Force Base, 1960. 

Cronbach, L. J. and Warrington, W. С. Efficiency of multiple- 
choice tests as а function of spread of item difficulties. Psy- 
chometrika, 1952, 17, 127-147, 

Sun E E. Rank-biserial correlation. Psychometrika, 1956, 21, 

Davis, F. B. Item selection techniques. In E. F. Lindquist (Ed.), 
Educational measurement. Washington, D.C.. American Coun- | 
cil on Education, 1951, Pp. 266-328. , 

Findley, W. С. A rational for evaluation of item discrimination 


Statistics. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
1956, 16, 175-180. | 


Glass, G. V. A rankin 


STEPHEN Н. IVENS s9 


Glass, G. V. Note on rank biserial correlation, EDUCATIONAL AND 


PSYCHOLOGICAL MEASUREMENT, 1966, 26, 623-631. 
Gulliksen, H. The relation of item difficulty and interitem correla- 
tion to test variance and reliability. Psychometrika, 1945, 10, 


79-91. 
Gulliksen, H. Theory of mental tests, New York: John Wiley & 


Sons, 1950. А 
— es A theory of test scores. Psychometric Monograph, 1952, 
NO. £f. 
Wilcoxon, F. Individual comparisons by ranking methods. Bio- 
metrics Bulletin, 1945, 1, 80-83. 


Koccartiowst axe Prvcmotocicas Maasvaswawe 
1971, 21, 851-854. 


BAYESIAN TECHNIQUES FOR TEST SELECTION 


W. PAUL JONES 
New Mexico State University, Alamagordo 
F. L. NEWMAN 
University of Miami, Coral Gables 


Іх the recent literature related to statistical inference, an increas- 
ing amount of attention has been given to the potential of the 
techniques included under the rubric of Bayesian statistics (Meyer, 
1966). While the basic Bayesian theorem dates back to 1763, an 
interest in applications of the theorem has come into prominence 
only in the past few years. A perusal of this Bayesian literature 
indicates that while there can be little question as to the validity 
of the theorem, its applications have generated a great deal of 
controversy (Binder, 1964). 

The purpose of this article is not to present a detailed analysis 
of the rationale behind Bayesian inferential techniques. Several 
references are available which provide such information (Meyer, 
1966; Binder, 1964; Edwards, Lindman, and Savage, 1963), and 
the reader desiring to explore in depth is referred to these sources. 
Rather, this article will present an example of Bayesian techniques 
for a specific problem, that of test selection. Horst (1966) indicated 
a need for new applications of decision theory to the area of 
Psychological measurement and prediction, and this article is ad- 
dressed to this need. The format of the article will be to present a 
hypothetical example of the application of Bayesian procedures in 
test selection and to discuss some of the ramifications of the method 
as applied to the example. It has been said that modern emphasis 
in Bayesian inference began with the publication in 1959 of Schlaif- 
ег'в Probability and Statistics for Business Decisions, and the pro- 
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cedures in the example were adapted from procedures suggested by 
Sehlaifer in a later work also oriented to business applications 
(1961). 

The two primary advantages of Bayesian techniques are: (а) 
prior information about the problem under investigation is utilized 
in the decision making process, and (b) a specific probability state- 
ment сап be applied to the results of the calculations. One of the 
major criticisms leveled against Bayesian procedures is based on 
this first advantage. Some critics insist that the utilization of prior 
information somehow lessens the objectivity of the decision making 
process, However, Meyer (1964) has noted that all decision making 
procedures, including hypothesis testing, contain some subjective 
underpinnings and added that someone with no previous knowledge 
about а problem is hardly equipped to investigate the problem. In 
terms of the second advantage, while students are prone to interpret 
all confidence intervals as if they were probability statements re- 
garding а point estimator, such interpretations have no basis in the 
rationale of classical procedures. 


Hypothetical Example 
The Problem : 


A counselor is faced with a decision regarding the use of a short 
aptitude test as an aid in placement of students in remedial, 
regular, or honors classes. The test will be used only if it has cor- 
relation equal to or greater than .50 with the criterion (academic 
grades). From past experience with aptitude tests in other scholastic 
circumstances, the counselor's data suggest that the correlation 
between test and performance will be .60, and furthermore, that the 
interval between .50 and .70 would approximately constitute a plus 
or minus one sigma interval for the coefficient. To make the assump- 
tion of a normal prior distribution more tenable, all coefficients 
are transformed into Fisher’s z functions. Thus, the prior distribu- 
tion has a mean of 69 (r = 60, z = .69) and a sigma of .16 
([270 — 255]/2 = .16). The criterion, the z transform of .50, is .55. 

The counselor then takes a sample of 50 randomly selected students 
and finds а correlation between aptitude test and performance of .71 
(expressed as a z transform) with a standard error of ~/1/(50 — 3) = 
-15. The question then is whether a classical inferential process 
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- gerves the counselor better than а Bayesian inferential process 


à 
s 


in aiding the counselor's decision as to whether he should use the 


tost. 
" 


The Classical Approach 


Here the counselor simply contrasts the two 2 values (71 and 
55) to test the null hypothesis that there is "no difference” with P 


Lx (Type I error) .05: 


2 Л\— 55 
М-Л ЛЕ 


` The conclusion would be: “Do not reject the null hypothesis.” Fur- 


| 


* 


d 


t 


ther, by computing a confidence interval and transforming the а 
functions back to correlation coeficients, the counselor may expand 
his nonrejection statement to say that the interval 40 to 76 has 
a 95 probability of covering the true predictive value. This is, 
however, the limit of inference allowed with this problem. 


The Bayesian Approach 


_ There are a number of inferential developments possible under а 
heading Baycsian inference. The one that seems most appropriate 
for this problem would be to first establish а posterior distribution 
based on the prior distribution and the sample, and to contrast it 
with the criterion of .55. From this procedure, a definite probability 
statement can be made regarding the criterion and the efficieney of 
the aptitude test. 

In accordance with the formulation suggested by Sehlaifer (1961), 
the mean (M;) and standard deviation (sdz) of the posterior distri- 
bution are: 


it _ Mh) + M.) “yc: АЙ 
М, = TE (ods) = FR, 


Da ^M = prior mean (.69) 


— M, = sample mean (.71) 


b .. hy = reciprocal of prior variance (1/.16?) 
iı = reciprocal of sample variance (1/.15°) 


ge 
Tn this example the values are: Mz of 0.70 and sd, of 011. The 


a 
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posterior distribution and eriterion can then be used to establish 
an explicit probability statement such as: 


pl < (= 45 - 1.30) | = 0.087 


The verbage of inference here takes a different tack than the classi- 
cal procedure, Given the posterior distribution, the probs! lity of 
obtaining a value z less than .55 is .087 or less than one in t1. Thus, 
the chances are better than nine out of 10 that the new test v | have 
a predictive value equal to or greater than the criterion. “he cri- 
terion value, .55, becomes the lower limit of prediction for the 
counselor’s probability statement, rather than the pivotal quantity 
of an interval. The counselor is not accepting or rejecting the 
sample data, but rather incorporating the data in his probability 
statement. Now, if he chooses to use the aptitude test, the decision 
to do so would be justified by the probability statement, and not 
simply because the test is “по different” from the criterion. 

According to a Bayesian view of the world, the sample data 
should now be incorporated into the counselor’s information for 
further decision making. For example, if the counselor wishes to 
draw another sample of 50 and run the correlation again, he can use 
information from the first sample in evaluating results of future 
samples. Specifically, he can determine the critical value (C.) below 
which the aptitude test’s correlation with performance would be 
suspect. Before collecting data on a second sample, the C, could be 
estimated (Schlaifer, 1961) as follows: 


Thus, given the posterior distribution (based on prior distribution 

and first sample), the counselor should not reject the aptitude test 

unless the second sample correlation is less than a z of .27. 
Discussion 


In order to use the procedures above, the only information neces- 


we 
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sary that would not typically be used is the prior distribution of the 
variable. The other data will be available to the counselor, but he 
certainly should also have some previous data or prior belief about 
the relationship between aptitude tests and performance criteria. 
Only in Bayesian procedures сап this previous knowledge be 
utilized in the statistical analysis. 

The decision to use Fisher's 2 functions, while necessitating some 
adjustments in the prior distribution was necessary to justify the 
assumption of a normal prior distribution. While Bayesian tech- 
niques are certainly not limited to normal prior distributions, the 
ealeulations involved are simplified when а normal distribution can 
be assumed. Schlaifer (1961) has indicated that if the variance of 
the prior distribution is large compared with the variance of the 
sample, the mean and variance of the prior distribution can be 
substituted into the formulae which apply to normal prior dis- 
tributions with no material loss in accuracy. This is evident 
in а perusal of the formula for determining the posterior mean 
which indicates that the posterior mean is a weighted average of 
the prior mean and the sample mean, and that the mean with the 
smallest variance receives the largest weight. 

Therefore, with a minimal amount of computation, the counselor 
is able to synthesize his previous knowledge of the problem with 
data from the sample and determine a specific probability to attach 
to his decision. This alone would seem to justify the procedures, but 
as noted in the example, he can further use both his previous knowl- 
edge and results from the first sample in decisions regarding the 
results of still other samples. 

Other applications of these procedures could include comparisons 
of different test instruments, an analysis of the value of adding 
more predictors, and various other problems of importance to the 
psychometrician. It is not suggested that Bayesian techniques will 
or should replace other inferential procedures. However, the tech- 
niques discussed in this article should be a useful addition to the 
statistical repertoire of personnel given responsibility for decisions 
regarding the use of various test instruments. 
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PROBLEMS WITH INFERRING TREATMENT EFFECTS 
FROM REPEATED MEASURES! 


CHARLES E. WERTS лхо ROBERT L. LINN 
Educational Testing Service 


A commonly encountered research design is that in which treat- 
ments are randomly assigned to available preformed groups and 
the differential effectiveness of the treatments is evaluated from 
measurements on the same instrument at the beginning and at the 
end of the experiment, it being known that the effect is appropriately 
indicated by changes in the group means. True experimental design 
in such a situation would require that more than one group be as- 
signed at random to each treatment. The group could then be used as 
the unit of analysis and the differential treatment effects could 
be tested using the variation among group means within treatments. 

In practice, however, there is frequently only one group per treat- 
ment and the individuals are used as the unit of analysis. Since 
the treatments are randomly assigned, either the analysis of co- 
variance using initial status as the covariate (this would be “usage 
2" mentioned by Evans and Anastasio, 1968) or a two factor 
(groups and time) analysis of variance for repeated measures 
(Winer, 1962) would appear relevant to the problem. In this paper 
the differences in interpretation that arise from using these two 
quasi-experimental procedures on the same data will be examined. 
While neither of these methods can be considered proper experi- 
Mental design a comparison of the logic of the two procedures 


a aaO‏ ڪڪ 
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provides а better understanding of the logic of some quasi-experi- 
mental procedures. 


Methodology 


Because our purpose is primarily to understand the logic behind 
the two methods, it shall be assumed hereafter that all measures 
are infallible and that random sampling procedures are perfect so 
that the treatment-covariate correlation is zero. It is also assumed 
that any extraneous variables that influence growth are unrelated 
to the influences being studied and that all relationships are linear 
additive. 


A. The Analysis of Covariance (ANCOVA) 
The mathematical model for ANCOVA is: 


Y; m A; T В.Х. + е; (1) 
where 


Yı; = final status, 

X “= initial status, 

B, = pooled within groups regression slope, 

A, = intercept of the Y on X regression line for group j, і.е, 
A; = Y, — В.Х, where Y, and X, are the respective Y 
and X means for group j, and 

€; = error term assumed independent of A; and X,; and with 
Zero mean. 

ANCOVA requires homogeneity of within groups regression slopes, 
which if not found would mean that the treatments could not be 
simply ordered on a scale of effectiveness. Given this assumption, 
the special ANCOVA case in which the final status Y, is the initial 
status Ху, plus a growth component б; (0; is the mean growth 
for group j) may be delineated. It follows from equation (1) that 
Sas hia m Hate Gh BIXER XS tl e Bino A, =. Ye 
BX; and Ё, = Xs + G;. 

Solution of the normal equations for B, yields B, = 1 + Bex. 
where Вох. is the within groups regression weight of growth on 
initial status and the one arises from the fact that part of final status 
is initial status. In this model, a nontrivial Bex, a indicates that the 
rate of growth is influenced by the level of the initial status. Sub- 
stitution of B, into the equation for the intercepts yields А; = 


| 
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f, — В.Х, = X, + б, — (1 + Box. X, = б, — Boy, X, In 
other words, the intercept represents the mean growth for a group 
(G,) corrected for the net influence of X on growth, ie, A, із а 
measure of the net influence of treatments on growth with X, 
controlled. 


In summary, when initial status is the covariate in the analysis of 
covariance, the "treatment effects" represent. the net influence of 
groups on growth with initial status controlled. To the degree that the 
within groups slope differs from unity it may be inferred in this 
model that initial status also influences the rate of growth for 
individuals. 

B. The Repeated Measures Analysis of Variance (Winer, 1962) 

The analysis of variance (ANOVA) design relevant to our exam- 
ple can be represented schematically ав: 


Time 
Initial Final 
Treatments j = 1 Group 1 Group 1 


j=2 Group 2 Group 2 
j=3 Group 3 Group 3 


In this analysis the “main effects” of treatments are completely 
confounded with differences between groups, however, the “main 
effects” of time as well as the treatment by time interaction are 
free of such confounding. Winer (1962) states: “The primary pur- 
pose of repeated measures on the same elements is the control that 
this kind of design provides over individual differences between 
experimental units. In the area of the behavioral sciences, differ- 
ences between such units often are quite large relative to differences 
in treatments which the experimenter is trying to evaluate.” De- 
pending on the particular effect being studied the experimenter may 
or may not be interested in the time main effect, i.e., whether across 
all treatments there is on the average a net change in status, e.g, 
if the effect of diet were being studied on groups of children, the 
average gain in weight over time would be confounded with matura- 
tional trends and therefore of little interest. The differential impact 
of various treatments is indicated by the treatment by time inter- 
action which is a measure of whether the mean changes differ among 
the treatments. Differential treatment effects in ANCOVA corre- 
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spond to variation among the intercepts, i.e., the A; in equation (1). 


Comparison of the Methods 
А case in which three treatments have the same mean gain is 
depicted in Figure 1. 
The repeated measures ANOVA would indicate no treatment by 
time interaction since the difference between initial and fal status 
is the same for all groups. The condition under which ANCOVA 
would also indicate no differential treatment effects can be derived 
аз follows: 
Since А, = P, — В.Х, 
v4! = cp! + Bog" — 2 Beosr, 
where e,*, es^, and os’ is the variance of the А,, Ӯ, and 
X, when these are assigned to individuals and 
esr = covariance of X, and Ў, when assigned to individuals. 
No differential effect implies that сл? = 0 and for the example in 
Figure 1 cp” = cy” = отр, therefore: 
es! + Ber — 2 Buor = 0, 
or B, = 1, 
If B, = 1 then Вох. = 0, ie., initial status does not influence the 
rate of growth. In general, the difference between the ANOVA and 


Test Retest 
Time 
Figure 1. The case of equal gain. 
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ANCOVA models for simulating the action of a process is that the 
ANOVA assumes that initial status does not influence the rate of 
an individual's growth whereas the ANCOVA design allows this 
question to be settled by the within groups regression slope, ie., by 
whether B, differs from unity. This difference can be seen more 
clearly from Figure 2 which depicts the so-called "fanspread" 
hypothesis. 

In this case the school means on retest correlate perfectly with 
the initial means but the former are more spread out. Because the 
differences between initial and final status are not the same for all 
groups thc repeated measures ANOVA will indicate a treatment-by- 
time interaction. In contrast ANCOVA will not yield a differential 
treatment effect. 

The within groups slope (which because of random assignment 
equals the between groups slope) for no differential treatment effect 
in ANCOVA can be derived as follows: 


In the given example the correlation of X, and Ӯ, is perfect, i.e. 


m SEE es 
Rrr ск) 1 


#1 
Treatment 
Means #2 
x 
кез 
Test Retest 
Time 


Figure 2. The Fanspread hypothesis. 
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Substituting into e = es! + Bog" — 2В„от+, 
for ¢,” = О we obtain: 


B. = ug; 
er 
In other words, if the within groups slope equals the (weighted) 
standard deviation of the final means divided by that of the initial 
group means then ANCOVA will indicate that the spreading apart 
of the school means is due to the influence of initial status on rate 
of gain rather than to a differential school effect. These results follow 
from the specification that the correlation between X, and P, is 
perfect. If the correlation between Ñ; and Ӯ, is less than one then 
e," will always be greater than zero. 

Our analysis indieates that in a linear model the repeated meas- 
ures ANOVA is an appropriate design for investigating differential 
treatment effects if initial status does not influence the rate of gain 
whereas ANCOVA would be appropriate if i& does. The real prob- 
lem is that in many studies the experimenter does not know 
apriori which is really the case. If the truth is unknown, then there 
is no way of knowing which method to believe. One possible ap- 
proach before deciding whieh method to use is to examine the 
within groups slope to see if it differs meaningfully from unity. In 
real data, problems of unreliability, failure to include other in- 
dividual characteristics which may influence the rate of change 
and failure of the mathematical model to simulate the true phenom- 
ena and scaling problems will cloud the issue. 


Application to Lord's Paradox 


Illustrating the problems of adequately controlling for pre-exist- 
ing group differences with an example in which two groups did not 
change in mean or variation, Lord (1967) pointed out that à 
statistician interpreting results from the analysis of covariance 
would conclude that there was a treatment effect whereas another 
statistician observing the lack of change in the means could rea- 
sonably conclude there was no treatment effect. It can be seen that 
Lord’s example is a special case of the problem shown in Figure 1, 
the difference being that in his example no mean growth was ob- 
served. The statistician comparing the means could logically have 
applied the repeated measures ANOVA. It follows that Lord's 
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statisticians implicitly had differing assumptions about the nature 
of reality which led to their contradictory interpretations. More 
precisely the ANCOVA user was in essence asserting that initial 
status does influence the rate of change in weight and that the 
deviation of the within groups slope from unity was an appropriate 
measure of this influence. In contrast the statistician who compared 
means was essentially asserting that initial status does not influ- 
ence change irrespective of the fact that the within group slope was 
less than unity. These differing theories led to the opposite conelu- 
sions and there is no way of judging which was right since the 
truth is unknown. Too often researchers have forgotten that the 
inferences drawn from statistical analyses can be no more valid 
than the degree to which the mathematical mode! simulates reality. 
Missing from many studies is an examination of the model under- 
lying the statistical measures in light of what is known or postu- 
lated about the phenomena under study. 


Application to Campbeli's Quasi-Ezperimental Approach 

Given pretest and posttest data, Campbell and Clayton (1961) 
note several symptoms of treatment effects: (a) increases in differ- 
ences between group means, (b) the posttest variance pooled across 
all groups is greater than the pooled pretest variance due to an in- 
crease in between school variance, and (c) the association of treat- 
ments with posttest is greater than that of the treatments with the 
ED This approach can be stated in terms of ANCOVA as fol- 


1. The association of treatments with either test is "eta," the 
ratio of the standard deviation of the means (weighted by group 
size) divided by the total standard deviation. When Ху is the pre- 
test and Yy the posttest scores as before, then condition (c) above 
сап be stated algebraically as: 


2 2 
os oF 
AXE 
сх сү 


2. For the case in which treatments аге independent of the co- 
variate (i.e., сах = сат = 0) the ANCOVA formulas f, x c^ Ie 
В.Х, and Y = A, + В.Х, + ei yield: e$ = сл + В. от, snd 


су? = с йг + HE y n 


~*~ 3. By substitution 


Ww 
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er e" + MTS 
ر‎ + By’ оу + e, 

4. Examination of this formula shows that if there is no differential 
treatment effect (Le, c," = 0) then (Beş) + (Box + o) 
will always be smaller than oş” + сх’. When the inequality is satisfied 
then са” > 0. From the perspective of ANCOVA this criterion is 
quite stringent since it can easily be shown that оа? can be nonzero 
when the inequality is not satisfied. When c," is large then a, 
must be correspondingly large to be detected using the inequality 
criterion. 

5. When Campbell's paradigm is applied to cases where treatments 
are not assigned randomly as in naturalistic school effects studies, 
then the formulas become more complicated since cp? = o, + 
Ber" + 2B.c,s and ey = ea + Bex" + 2В,сах + c," where 
Фах = сат. It is still true that сд? > 0 when the inequality is sat- 
isfied but e4 > 0 may be true even when the inequality is not 


satisfied especially in some of the cases when B, ог сах are negative. 
б. The researcher who uses the Campbell-Clayton approach has, 


like Lord's statistician who used ANCOVA, assumed that initial 
status can influence growth. Furthermore, this approach is unsatis- 
factory for detecting influences which tend to reduce differences be- 
tween groups such as compensatory education efforts or social pres- 
sures towards societal norms. 

_ The example used in this paper in which treatments were as- 
signed randomly to preformed groups can be classified as a multi- 
group variant of Campbell and Stanley's (1963) design No. 10, 
the nonequivalent control group design. A control group receiving 
no treatment may for statistical purposes be handled as another 
treatment group. The notion that initial status may influence growth 
is an instance of what Campbell and Stanley call the "interaction 
of selection and maturation.” ANCOVA in essence tries to rule 
out one type of selection-maturation interaction by using the within 
groups slope to estimate the effect of differential selection (i.e. 
mean initial status) on maturation. It is known that ANCOVA re- 
sults become difficult to interpret when within group homogeneity 
of regression is not found, (which might mean that whether one 
treatment is better than another depends on which subject it is ар- 
plied to). In these cases the notion of a treatment effect applicable 
to everybody in a group is no longer appropriate, yet such a notion 


| 
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is probably implicit in most uses of ANCOVA and the Campbell- 
Stanley design 3£ 10. When homogeneity of within group regression 
is found ANCOVA rules ош the selection-maturation hypothesis 
that the change in means is proportional to initial status. 1f the 
change in means is proportional to initial status the inerease in the 
within group variance will be in general proportionately larger than 
the increase in the between groups variance since e,* ia greater 
than zero. In such cases the treatment-test correlation will de- 
crease from test to retest, i.e., neither the Campbell-Clayton eri- 
teria nor ANCOVA will yield a differential treatment effect. 


Comparison of the ANCOVA to the ANOVA model 


A major difference between the ANCOVA and the ANOVA de- 
signs is that the latter does not interpret the within groups ĉo- 
variance between initial and final status. Therefore, it is consistent 
with the logic of ANOVA to measure only a random sample of the 
members from each group on pretest and a new random sample 
from that group on posttest. Similarly it would be reasonable in 
ANOVA to randomly split the members of each group such that 
half received the pretest and half the posttest, as a means of avoid- 
ing practice effects. Thus Schaie’s (1965) model for studying de- 
velopmental effects shares the logic of the ANOVA model with re- 
Spect to the influence of initial status. 

The neglect of the within groups covariance suggests that from 
the viewpoint of the general linear model the test-retest ANOVA 
design may be considered as a special case of the ANCOVA model. 
Consider the dummy variable form of ANCOVA: 


Y, = BZ, + BZ, + +++ + Bess + В„Хи + en 

where 

Z, = dummy code of 1 for everybody, and 

Z, = 1 for persons in group j, О for others 

(J — total number of groups). Р 
If B, уеге unity then the Ху term could be shifted to the left side 
of the equation yielding: 
Yu — Xu = Gu = BZ, + Bids + + Bands + er 

This equation is simply the analysis of variance of change scores 


= which yields the identical differential treatment effect indicated 
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by the treatments by time interaction in the repeated measures de- 
sign. Whereas the ANCOVA model is the special case of the linear 
model with homogeneous within group regression, the treatment 
by time interaction of ANOVA analysis of test-retest measures may 
be considered the special case of the repeated measures ANCOVA 
model with the additional requirement that initial status not in- 
fluence growth, i.e., Be = 1. 
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ARE THERE TWO EXTREMENESS RESPONSE SETS? 


LEONARD V. GORDON 
State University of New York at Albany 


Тнв extremeness response set is defined as the individual's tend- 
ency to use the more extreme scale alternatives when responding 
along an intensity dimension, such as one ranging from "Strongly 
Agree" to "Strongly Disagree." Since this response set has been 
assumed to be independent of the scale content itself, some investiga- 
tors have recommended its elimination through the use of an ap- 
propriate scoring scheme, while others have taken it to be а worth- 
while subject of study in its own right. 

Hamilton (1968) in a comprehensive review of the literature 
relevant to the extremeness response set concluded that it сап be 
reliably measured but described the evidence regarding its correlates 
аз contradictory or at least unclear. He ascribed this lack of agree- 
ment largely to the diversity of scale content and the different 
techniques for extremeness response set measurement employed in 
the studies reviewed. Another possibility, apparently not consid- 
ered by Hamilton, is that there are two extremeness response sets, 
Опе at each pole, which are sufficiently different from one another 
to require separate summarizations. His own observations that the 
correlation between the measures at the two poles is substantially 
lower than the response set reliabilities suggest that this may very 
Well be the case. 

That the tendency to mark extreme responses at the positive and 
negative poles of a response continuum have substantially different 

В istics was incidentally observed in three studies by the 
Writer: (a) In the first (Gordon, 1952) three forms of а 150-item 
Personality inventory, employing five-choice response format, were 
randomly administered to large samples of subjects, with the same 
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30-items appearing in the first, middle and last sections on differ- 
ent forms. It was consistently found that the later the same socially 
undesirable items appeared in the questionnaire the greater tbe 
number of extreme negative endorsements they received. On the 
other hand, no counterpart increase in extreme positive endorse- 
mente of socially desirable items was noted. (b) In the second 
(Gordon, 19678) the validities of various scoring schemes for the 
Work Environment Preference Schedule (Gordon, 19684), а 5 
point Likert scale designed to measure the “bureaucratic person- 
ality,” were compared. Responses at the disagreement end of the 
continuum were found to contribute differentially to scale variance, 
while those at the agreement end did not. The scoring scheme 
which differentially weighted extremencss responses only at the 
negative pole yielded the highest validities against relevant ex- 
ternal criteria. (е) In the third study (Gordon and Kikuchi, 1970), 
а school form of the Work Environment Preference Schedule in the 
original and in translation was administered respectively to coun- 
terpart American and Japanese high school students. In both cul- 
tures and for both sexes, significant differences were found bè- 
tween the contributions of the response set scores at the negative 
and positive poles to total scale variance, that for the negative pole 
being substantially larger. 

The present study was conducted to further examine the validity 
of the extremeness response tendency at each of the two poles— 
employing two widely used research instruments, the California 
Authoritarianism or F-scale (Adorno, 1950) and the Dogmatism of 
D-seale (Rokeach, 1960). The design follows that of Korn and 
Giddan (1964) who used external measures of relevance to the 
construct validity of the Dogmatism scales as criteria. The expec- 
tation, based on prior findings, is that the two response set scores, 
one based on positive and one based on negative responses, will 
differentially contribute to seale validity and that the negative 
response set score will make the greater contribution. 


Procedure 
The California F-scale and Rokeach's D-scale each employ re” 
sponse alternatives representing three degrees of agreement and 


disagreement: “Very Strongly Agree” (VSA), “Strongly 
(SA), “Agree” (A), “Disagree” (D), “Strongly Disagree” (SD) 


: 


, 


presented in the rows of Table 1. 
` "The C score is based on the conventional weighting еве: the 
Р score utilizes the dichotomous weights recommended by Cres. 
bach (1950), Peabody (1962) and Korn and Giddam (1904) 
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set score is computed by obtaining the weighted sum of 
dividual’s responses at the appropriate half of the continuum and 


and Independence as measured by 
onal Values or SIV (Gordon, 1960) and Variety and Order- 
as measured by the Survey of Personal Valum or 
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thoritarianism and Dogmatism would be expected to place a high 
value on conformist behavior and on being systematic and orderly, 
and а low value on independence of action and on openness to new 
and varied experiences, as assessed respectively by the four value 
scales. 

The F-scale, D-seale, SIV and SPV were administered to а 
sample of 212 students primarily in the upper grades of a university 
high school, and product moment correlations among the several 
scores were obtained. 


Results 


A positive correlation between the X and Y scores serves as 
evidence for the existence of an extremeness response set since in 
unidirectional scales more extreme responses in one direction would 
be expected to be associated with less extreme responses in the 
other if content were the only consideration. In the present instance, 
the correlations between the X and Y scores were .48 and .63 for 
the Authoritarianism and Dogmatism scales respectively, reflecting 
а decided extremeness response set tendeney on the part of these 
subjects. 

Product moment correlations of the X and Y scores with the С 
and P scores for both the Authoritariansim and Dogmatism scales 
are presented in Table 2. It will be noted that the X and Y scores 
are not independent of scale content, but in all instances are sig- + 
nificantly related to the C and P scores of their respective instru- 
ments, That the relationships with the P score are significant is of 
interest since the latter score is designed to be extremeness-response- 
set free? That the correlations with the С score are higher than 


TABLE 2 
Correlations of the X and Y Scores with the C and P Scores of the Authoritarianism 
and Dogmatism Scales 
Authoritarianism Dogmatism 
0 Р c. Р 

X .96** .14* x ++ 16* 

Y —.30°* —.25%* Y E —. Me 
р < .05. 
** p < .01. 


2 The confounding effects of content and styl ilton (1908) . 
is illustrated here. E pene Hamilton ! 
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those with the P score would be expected since the C score gives 
: — added weight to extreme responses. The relationships of the Y 
scores with the total scale scores are somewhat higher than those of 
the corresponding X scores, however none of the differences are 
statistically significant. 

Product moment correlation of the several types of scale scores 
(C, P, Н, and L) and two response set scores (X and Y) with the 
four forced-choice criterion scales are presented in Table 3. It will 
be noted that the four Authoritarianism scores are significantly 
related in a positive direction to Conformity and Orderliness and 
in a negative direction to Indendence and Variety. The correla- 
tions for the four Dogmatism scores are directionally identical, 
but of uniformly smaller magnitude. All relationships are in the 
hypothesized direction. 

А comparison of the relationships of X and Y response set scores 
with each of the criterion scales reveals that in seven of the eight 
cases the absolute value of the correlation of the Y score is the 
higher. However, more important, in all eight instances the signs 

of the Y score correlations are directionally congruent with both 
theoretical expectation and the obtained Р or D scale validities, 
while in six out of eight instances the signs of the X score corre- 
lations are in directions opposite to what would be anticipated on 
these bases. For example, individuals who are the more extreme 
when agreeing (X) with Authoritarianism items would be expected 
to score lower in Independence—yet the obtained correlation is 
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positive in sign (.09); those who are the more extreme when dis- 
agreeing with Authoritarianism items (those with higher Y scores) 
would be expected to score higher in Independence, and the rela- 
tionship is positive (.19). 

While the difference in directional congruence of the X and Y 
score validities is pronounced, an overall test of significance is pre- 
cluded since the 16 response set coefficients are not independent. 
However, the differential influence of the X and Y response sets on 
scale validity may be assessed indirectly by comparing the validities 
of the H and L scores, since each of the latter scores is weighted so 
as to capitalize on response set variance at its respective end of the 
continuum. It will be noted (Table 3) that in the two instances 
where the validities of the X scores and the scale scores are con- 
gruent in sign (on Authoritarianism, with Conformity and Order- 
liness), validities of the H and L scores are not significantly 
different. Out of the six remaining instances where the signs of the 
validities of the X score and the scale scores are not congruent, in 
four cases the validities of the L scores are significantly higher 
(p < .01) than those of the corresponding H scores.? 

Discussion 

The results of the present study indicate that the scoring scheme 
that included only negative response set variance on the whole 
yielded significantly higher validities than that which included 
only positive response set variance. Thus, it may be inferred that 
the extremeness response sets at the two ends of the continuum 
differentially contribute to scale validity. One other study spè- 
cifically designed to test this latter hypothesis was noted in the lit- 
erature. Mitzel, Rabinowitz and Ostreicher (1956), using the Min- 
nesota Teacher Attitude Inventory (MTAI) as their research 
instrument, supervisory ratings of teachers as the criterion, and 
Tesponse set measures statistically identical to the X and Y scores, 
observed that, only the negative response set had significant valid- 
ity. "The negative intensity response set was found to influence the 
test scores in such a way that test validity is increased by its pres 
ence. Positive intensity . . . was found to exert very little effect on 
MTAI validity" (p. 514). 


That there are two functionally different extremeness response 


3 Hotelling's test of si 


(Guilford, 1965). gnificance for correlated correlations was employed 
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sets is supported by the weight of available evidence based on ex- 
ternal validation, internal analysis, and correlations between the 
response set measures themselves. Thus, whether extremeness re- 
sponse sets are studied in their own right or to assess their effect on 
scale validity, separate analyses would certainly appear to be 
called for. Failure to do so may preclude the making of proper in- 
ferences regarding their characteristics or their influence. 
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PREDICTION OF INDIVIDUAL STABILITY! 


GEORGE V. C. PARKER 
University of Texas at Austin 


Ix the history of the study of individual differences, one con- 
sistent observation has been that many organisms, human Ss in 
particular, are not consistent in their responses to repeated pre- 
sentations of the same situations. The behavioral instability phe- 
nomenon has been rather problematic from both a theoretical and 
a practical standpoint. With his concept of behavioral oscillation, 
for example, Hull (1952) explicitly and formally recognized the 
theoretical significance of intraindividual variability but experi- 
menters have typieally averaged their measures, thus avoiding 8 
direct confrontation with the issue of within-subject variability. 
In the area of test construction, response instability has tradition- 
ally been categorized as a kind of nuisance “error” to be minimized 
in order to achieve test reliability. Mischel's (1968) review of the 
assessment literature suggests that attempts to demonstrate re- 
liability of “traits” via personality tests has not been entirely 
successful; he found substantial reliability only for tests of intel- 
lectual and cognitive style. Explanations for response inconsistency 
are many and have been with us for some time. For example, 35 
Years ago, Lents (1934) attributed test response variability to such 
Vicissitudinous characteristics as S indifference, as well as defi- 
ciencies in areas such as sincerity, sympathy, and appreciation of 
scientific method. While Lentz is doubtless not the first psy- 
атын Study vas supported in part by a grant from the фис нө 


Institute of The University of Texas at Austin, Protocol олу Utt, re 


875 


56 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
chologist who has charged Ss with not caring, it is equally certain 
that our explanatory wherewithal is not greatly improved today. 
For example, Block (1968) has recently examined some of the psy- $ 
chometrie and psychological reasons for behavorial inconsistency. 
While rather gloomily noting the lack of inspiriting empirical evi- 
dence for personality consistency, his speculations about the causes 
of inconsistency are not greatly different from Lentz's. 

An alternative approach to the problem of intraindividual 
variability /stability is to consider it as а lawful phenomenon open 
to systematic investigation in its own right. Data from available 
studies, such as Fiske (1957), Berdie (1961), Baumeister and 
Kellas (1968), Carrigan (1963), and Worell (1963) suggest that 
this approach may be fruitful. Indeed, it seems likely that both 
theoretical and experimental understanding of the phenomenon of 
individual differences in stability could improve the validity of 
the prediction of the behavior of individuals. The present study, 
therefore, had the primary objective of evaluating further the use- 
fulness of conceptualizing intraindividual stability as a personality 
construct. This was done by assessing the feasibility of developing 
а psychometric scale which might be used to predict reliably in- 
dividual stability in self-description, self-concept, and other be- 
havioral domains. Secondary objectives were to explore other con- 
comitants of the scale. 

v 


Method 


Development of the Stability scale (Stb) was based on the - 
Gough Adjective Check List (ACL) which has shown consider- 


Parke, کت‎ i h instrument (Gough and Heilbrun, 1965; 


One hundred forty male and 140 female undergraduates vol _ 
teering from introductory psychology classes were given the ACL 
wdir standard conditions a total of three times. Following the 

ACL administration, the procedure was repeated twice at 


two-week intervals, providing three ACL in a test-retest 
N ETS protocols in а : 


| 
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 Intraindividual Stability Criterion Measure 
4 After scoring the protocols for the 24 standard ACL scales 
(Gough and Heilbrun, 1965), the variance of each sesle score was 
tomputed for each S. The average of these scale variance scores for 
each S furnished his overall index of self-deseriptive stability. For 
the sexes separately, Ss were grouped into quartiles according to 
{ ‘the distribution of stability indices, providing high- and low- 
* ability criterion groups for ACL item-analysis of endonement 
differences by z tests (McNemar, 1962, р. 60). The study of patterns 
of self-description associated with stability was undertaken by 
sexes separately because earlier work (Parker, 1969) has shown 
systematic sex-related differences in patterns of ACL item endorse- 
ment. 
| Results 
For male Ss, Table 1 lists the adjectives for which stability 
criterion groups differed in rates of endorsement, These items com- 


TABLE 1 
ACL Stability Scale Adjectives for Molar 
Indicative Adjectives ($7 Пета) 
1 
Proportion Proportion. 
of High- of Low- of Hiro 
| Stability Stability Mala Males 
Endo: Endorsing Endorsing 
Adjective Item Rem ___Adietive ев Hen _ 
alert 9 63 pleasant 86 2 
Attractive 0 34 poised 49 
dear-thinking 94 66 practical м 2 
dever 74 49 progressive s з 
confident 80 46 quick п s 
conscientious 8з 57 rational s п 
determined 89 49 reasonable 100 3 
dignified 63 34 relaxed п ө 
dominant 49 23 reliable н A 
enterprising ^ "Án E a 
63 31 
foresighted 69 37 sharp-witted ы 8 
individualistic 853 57 ly 46 
a 77 40 stable 50 = 
80 26 tactful 80 % 
ge E Nu tolerant A 49 
cr a Hi versatile 80 5 


jr 
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TABLE 1 (Continued) 


ACL Stability Scale Adjectives for Males* 
Contraindicative Adjectives (18 Items) 


Proportion of High- Proportion of Low- 


Stability Males Stability Males 
Adjective Endorsing Item Endorsing Item 
НЕСЕР НАЙ les sone аана ИИИ —_ 
awkward 03 34 
careless 09 43 
coarse 03 19 
complaining 11 40 
confused 11 51 
dreamy 23 51 
emotional 40 69 
forgetful 16 51 
fussy 03 31 
gloomy 06 29 
impulsive 17 54 
lazy 17 57 
moody 40 74 
nervous 23 54 
shy 29 54 
slow 00 37 
sulky 06 26 


Е 
E 
E 


* Differences significant р < .05. 


prise the ACL Stb scale for males. Because there are 37 high- 
stability and 18 low-stability items, the ACL Stb scale scoring was 
organized in the indicative-minus-contraindicative manner typical 
of most other ACL scales. Thus, an individual's ACL Stb scale 
Raw Score = ¥ High-Stability items minus ¥ Low-Stability items. 
The average Stb raw score for an independent sample of 35 high- 
stability males = 19.7; for 35 low-stability males = 6.7; F = 558; 
df = 168, p < 01. This compares closely to the data from the 


original sample males: high-stability = 20.2, N = 35; low-stability 
= 6.9, N = 35; Р 


For female Ss, T: 


high-stability females = 5.3; for 
F = 13.8, df = 1,72, p < 01. This 
the original sample females: hi 
stability = 5.0, № = 35; F = 


compares well with the data fron 
gh-stability — 93, N = 85; low 
16.7, df — 1,68, p < .01. 
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TABLE 2 
ACL Stability Seale Adjectives for Females 
Indicative Adjectives (19 Itema) 

Proportion Proportion 

of High- S of of Low- 

Stability — Stability Stability 

Е 2 Endorsing 

Endorsing Endorsing 
Adjective Item Item Adjective Item Item 
active 86 58 initiative 5 31 
alert 94 61 logical 8з 53 
appreciative 100 78 mature 74 33 
efficient 86 50 outgoing 66 28 
energetic 74 39 peaceable 86 58 
fair-minded 94 72 precise 40 м 
feminine 86 56 quick 43 17 
forgiving 94 72 resourceful 49 22 
generous 80 56 thorough 63 25 
humorous 89 67 

Contraindicative Adjectives (10 Items) 

awkward 20 44 immature 17 4T 
conceited 06 31 inhibited n 3 
cynical 14 42 lazy я 
disorderly 09 31 moody 60 83 
egotistical 11 36 preoccupied 11 36 


* Differences significant р < .05. 


A test-retest procedure, using an independent sample of 100 male 
and 100 female undergraduates, who were given the ACL twice 
under standard conditions, with an interval of three months, 
yielded acceptably high reliability coefficients for the Stb scale 
items: males = .81, females = .78. An interesting question which 
was addressed with this sample is whether it is sensible to try to 
develop a Stb scale, the reliable responses to which would predict 
instability (unreliability) in behavior. Consequently, the question 
whether Stb reliability is related to Stb score was examined by 
comparing the test-retest Stb reliability coefficients of high and 
low-stability Ss (the upper and lower quartiles of Stb scores). For 
the high-stability males (N = 25), the Stb reliability coefficient 
Was .82, compared to .78 for the 25 low-stability males (Z = 037 , 
P > 05). A similar comparison for female Ss yielded а Stb reliabil- 
ity coefficient of .80 for 25 high-stability vs. .76 for 25 low-stability 
females (7 = 0.34, p > .05). From this it сап be coneluded that 
the ACL Stb scale is equally reliable for high and low-stability 
Ss. 
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Because in the ACL format scale scores are correlated positively 
with total number of adjectives endorsed, a correction factor must | 
be introduced. The procedure used by Gough and Heilbrun (ef | 
Gough and Heilbrun, 1965) was followed identically in this stu dy. 
Four-eategory standard score conversion tables were calculated. 
for the ACL Stb scale, based upon a sample of 2,212 females and 
2,805 males. Tables 3 and 4 present the standard score conversion 
data, Fundamentally, a high Stb scale score indicates a tendency t 
endorse ACL items in the direction of Ss who are consistent in Ве 
description, while a low Stb scale score indicates a tendency to T 
spond to the ACL items in a direction indicative of inconsistency | 
self-description. 

Summarized in Table 5 are the statistically significant co 
tions between the Stb scale and other ACL scales as well as several 
widely used psychometric scales from the Minnesota Multiph 
Personality Inventory (MMPI). The Stb scale shows significant 
positive relations with those ACL scales which are associated 
typically with good personal adjustment, particularly in a colle 
sample, e.g., Achievement, Self-Confidence, Endurance (c.f. Heil 
brun, 1960, 1961 a, b). At the same time, it bears negative relation 
ship to scales related to poor emotional adjustment, e.g., Succoranes 
and Counseling Readiness of the ACL as well as several clinical 
scales of the MMPI. 

Discussion { 

Inspection of the adjectives contained in Tables 1 and 2 sho! i 
that those items associated with high-stability describe very 8 
tractive, mature, and socially-desirable characteristics. On the о 
hand, the self-attributed qualities of low-stability Ss (Stb 
traindicative items) are quite uncomplementary, self-critical, al 
socially-undesirable. In fact, the overlap between these groups 0! 
items and the ACL Favorable Items scale (Fav) and Unfavorabl 
Items scale (Unf) is considerable, For male Ss, 23 of the 37 high 
stability adjectives appear on the ACL Fav, while seven of the 1 
low-stability items appear on the ACL Unf. Similarly, for fema! 
Ss, 11 of the 19 high-stability items are part of the ACL Fav, amt 
six of the 10 low-stability adjectives are contained in the 4 
Unf. Basically what this means, then, is that Ss who said general} 


rm 


favorable things about themselves on the first ACL administration — 
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TABLE 3 
Males—Conversion of Raw to Standard Scores for ACL Stability Seale 


ЕЕН 


Total Number of Adjectives Endorsed 
1-75 76-95 96-121 122-300 
Raw Standard Scores 
- 1 1 
tC -24 1 2 2 
-23 2 3 3 
-2 4 5 4 
-21 5 6 5 1 
-20 7 7 7 2 
-19 9 9° 8 3 
-18 10 10 9 4 
-17 12 11 10 6 
-16 13 13 il 7 
-15 15 14 13 8 
-M 17 15 14 9 
-13 18 17 15 10 
-12 20 18 16 12 
-11 21 19 17 13 
-10 23 21 19 м 
- 9 25 22 20 15 
= 26 23 21 17 
- 28 25 22 18 
- 30 26 23 19 
-5 31 27 25 20 
-4 33 29 26 2 
-8 34 30 27 x 
-2 36 31 28 5 
e 38 33 30 26 
0 39 34 31 E 
1 4l 35 3 2 
2 42 37 33 30 
3 44 38 34 31 
| 4 46 39 36 z 
5 47 41 37 33 
6 49 42 38 35 
7 50 43 39 36 
8 52. 45 40 37 
9 54 46 42 38 
10 55 47 43 40 
11 57 49 44 41 
12 58 50 45 42 
13 60 51 46 43 
14 62 53 48 “ 
15 63 54 49 46 
16 65 55 50 47 
17 67 57 51 48 
18 68 58 52 49 
19 70 59 54 50 
м. 2 т 61 55 52 
21 73 62 s 
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TABLE 3 (Continued) 


Malee—Consersion of Raw to Standard Scores for ACL Stability Scale 


Total Number of Adjectives Endorsed 


1-75 
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tended to be quite stable in self-d 
istrations, while those who described 
much less stable in self-description 
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TABLE 4 


Females—Conversion of Raw to Standard Scores for ACL Stability Seale 


4 
| 


Total Number of Adjectives Endorsed 


99-119 
Standard Scores 


79-08 


1-78 
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"*-9592435888929990559999cft992952585 
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Raw 
-13 
-12 
-ll 
-10 
-9 
-8 
-1 
-ô 
-5 
-4 
-3 
-2 
=1 
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TABLE 5 
Intercorrelations between Stability Scale and Other Measures* 


ACL Scales rb ACL Scales r 
Defensiveness 58° Unfavorable —56 
Favorable 56 Succorance —46 
Self-Confidence 65 Abasement —36 
Self-Control 39 Counseling 

Readiness —4l 
Personal Adjustment 59 
Achievement 73 MMPI Scales 
Dominance 70 F —31 
Exhibition 26 р —27 
Endurance 51 D304 —40 
Order 47 MAS —36 
Intraception 42 Pt —36 
Nurturance 38 Sc —32 
Affiliation 48 Si —34 
Heterosexuality 36 Pd —23 


* Based on N = 131 undergraduates (66 males and 70 females). 
* All correlations significant P < 01. 

* Decimals omitted. 

4 From Dempsey (1964). 


It is worth noting in this context that there are sex differences in 
ACL items that differentiate between high- and low-stability Ss. 
There are only five high-stability items for males and females in 
common; they are alert, efficient, initiative, quick, and thorough. 
For the low-stability adjectives, only three were common to both 
male and female Ss; they are awkward, lazy, and moody. Why 
there should be so few items in common across sex is not alto- 
gether clear, although a partial explanation may be that a con- 
siderable proportion of the remaining adjectives are related to sex- 
specific endorsement, tendencies, For example, 10 of the 37 high- 
stability items for males are included as masculine items in the 
ACL Femininity scale (Parker, 1969) . 

In addition to these data and the obvious face-validity that the 
high-stability adjectives (Tables 1 and 2) have for indicating nor- 
mal or good adjustment, there are further data to suggest a positive 
relationship between stability and adjustment. Using the ACL, 
Goodman and Mendelsohn (1969) recently surveyed psychothera- 
pists nationally to obtain information about therapists’ values and 
attitudes toward patients (males) they treat, Comparison of the 
male high-stability adjectives (Table 1) with therapists’ percep- 
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tions of the adult male who has a satisfactory adaptation to him- 
self and his environment (normal adult male), reveals a substantial 
overlap. Specifically, adjectives in common are alert, clear-thinking, 
confident, conscientious, reasonable, reliable, responsible, stable, 
tactful, and tolerant. Goodman and Mendelsohn provide no data 
on females for comparison. However, the clear suggestion from 
these data for males, and the overall ACL profiles for high- and low- 
stability Ss is that stability, as defined in this context, is related to 
level of personal adjustment. 

It could be argued that a person who ascribes certain undesir- 
able properties to himself, such as the low-stability characteristics 
disorderly, egotistical, immature, lazy, moody, sulky, and worry- 
ing, would, by definition, be suffering from personal problems or 
internal conflicts. In this sense, the results of this study are com- 
parable to those obtained by Worell (1962; 1963), who found that 
college students with high “intraindividual conflict,” as compared 
with less conflicted students, showed lesser stability in reaction 
time and discrimination responses. The fact that this relationship 
between stability and personality disturbance has been obtained 
with very different measures of stability and of maladjustment 
suggests that the phenomenon is a general one which merits further 
investigation. 

Theoretically, these findings are in accordance with the views of 
Baumeister and Kellas (1968), who view low-stability as a “gen- 
eralized expression of behavior pathology.” Erikson’s (1968) theo- 
retical views may also be related to the present findings. Erikson 
suggests that young people who have not yet achieved a stable 
sense of identity will experience any of а variety of emotional 
difficulties. If stability in self-description may be taken as an index 
of identity confusion, Erikson's views are supported by the present 
data. The question of whether stability can always be taken as an 
indication of maladjustment remains to be answered. There are, of 
course, situations in which flexibility of response is adaptive. Fu- 
ture investigations might determine which specifie types of stability 
are related to various measures of maladjustment, and which are 
not. Moreover, in addition to the adjustive implications of the 
Stb scale, it may contribute to the understanding of personality 
concomitants of behavioral dimensions such as rigidity of atti- 
tudes, interpersonal evaluations, or perceptual stability. 
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А ONE-STEP NOMOGRAPH FOR THE 
KOLMOGOROV-SMIRNOV TEST 


M. REEB 
Bar Ilan University, Israel 


Herrick (1969) has given a nomograph for finding the two-tailed 
Kolmogorov-Smirnov statistic D for p = .05. He uses a log transfor- 
mation, plotting D as the vertieal axis against the smaller sample 
size as the horizontal, for various ratios of the two sample sizes, from 
1.0 (equal sample sizes) to 0 (one-sample case). The nomograph 
given below is more convenient to use in that no calculation at all, 
and only one setting of a ruler, are required, and the scales are sim- 
pler to use. More information is also obtained in that both .05 and 
01 levels of significance are given and most cases of small sample 
вле are covered. 


Rationale 
Two-Sample Case 


The large-sample critical value of the K-S statistic D (Siegel 
1956) is given for the two-tailed test by 


DSR fa + Me (1) 
= NMa 


where D is the largest difference between the two cumulated dis- 
tribution functions of mı and ns cases at any one point, and k is a 
constant depending on the level of significance previously selected 
to reject the null hypothesis. 

Ow, in a right-angled triangle of lesser sides m and лә, if a line be 


i awn bisecting the right angle to meet the hypotenuse, and its 
ength is d, then 
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а = và طت‎ (2) 


т + na 
Combining (1) and (2) we obtain 


per (3) 


If now we draw a nomograph with m and пг as vertical and 
horizontal axes respectively then D, the required minimum sig- 
nificant difference, for a particular value of n; and na, is а simple 
function of k and of d, the distance from the origin along the 
diagonal bisecting the right angle between the axes. The diagonal 
is, therefore, calibrated by calculating, for each D, the critical dis- 
tance given, from (3), by 


2 
a= XE @ 
One-Sample Case 
Here, as in Siegel (op cit), 
k 
— (5) 
D2 Vn 


which corresponds to (1) when nı = n and nz — ©. Thus in the 
nomograph the line “between n, and ma" becomes a line from T: 
parallel to the horizontal axis, intersecting the diagonal at D. Or 
else, using the same scheme as above, 


DE and @ = 15 © 


where d' is measured along the vertical axis from the origin. 


Construction of the Ni omograph 


In drawing the nomograph (Figure 1), the distances from the 
origin are calculated in terms of the required Ds by means of 
equations (4) and (6). Values of К were calculated to four figure? 
by interpolation from Smirnov (1948), and give for the two-sample 
case, d = 2.6080/D? for the .05 and 3.7482/D? for the .01 levels of 
significance, and for the one-sample case, d’ = 1.8442/ D? for 05 and 
2.6504/D? for 01. 

These functions were calculated at intervals of D of 0.01 for the 


D: Two - Samples 
Level 


"05, -01 


© 
2 


1 
50 


100 
n 


5 
20 
25 
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35 
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5 
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ше. Nomograph for the Kolmogorov-Smirnoy statistic D (two-tailed test). 


200 


150 


100 


80 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


limits given below and drawn within practical bounds of discrim- 
inability for the scale of the ns chosen. Results for the two signifi- 
cance levels were drawn on opposite sides of the diagonal. For con- 
venience, the one-sample D calibration was done on a line parallel 
to the n, axis, and to the left of it. 


Sample Size Limitations 


For both the two- and one-sample cases, for large numbers of 
cases (m, na > 40), the only upper limitation of size is the scale of 
the nomograph. The largest value of D caleulated corresponds to 
the highest value of n; and na. 

For smaller samples, for the two-sample case where n, = n, the 
values obtained from (1) are the same as those given in Table L in 
Siegel (op cit) down to eight cases, and the nomograph is, therefore, 
applicable to this limit. For the one-sample case, the values in Ta- 
ble E of Siegel are not the same as those given by (5), and the nomo- 
graph is, therefore, calibrated by graphical interpolation from the 
values of Table E. (The reason for this asymmetry seems to be the 
necessarily “unequal sample size” when there is only one sample.) 


Use of the Nomograph 


For the two-sample case, set a ruler at the appropriate values of 
m and na; it will cut the diagonal at the critical minimal D for 
significance at the .05 and .01 levels, as marked. 

For the one-sample case, set the ruler at the appropriate m value 
horizontally; it will cut the vertical line on the left at the critical 
minimal D for significance at .05 and .01. 

The nomograph can be used, for both two- and one-sample cases 
for all values of m, n; > 40. For fewer cases, it can be used for the 


one-sample case without limitation, and for the two-sample case, 
only if ny = m > 8. 
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POSTEXPERIMENTAL ASSESSMENT OF AWARENESS 
IN ATTITUDE CONDITIONING 


MONTE M. PAGE 
University of Nebraska at Lincoln 


In the recent attitude change literature there has arisen a contro- 


1969) regarding the interpretation of a well known laboratory ex- 
periment intended to classically condition attitudinal affect to previ- 
ously neutral stimuli. The original authors (Staats and Staats, 
1958) claimed they had demonstrated that attitudes are acquired 
through a process similar to classical conditioning, and that this oc- 
curred “without awareness-without cognition” on the part of the 
Subjects, The crities of this interpretation have asserted that some 
Subjects in this situation do become aware and that the experimental 
efect can be accounted for in terms of this awareness. In his recent 
Ула to his critics, Staats (1969) still asserts that his original 
nterpretation of the study was correct. He claims that his single 
"d approach to assessing awareness postexperimentally was of 
E validity, &nd that those who have used more elaborate 
latin DDaires in assessing awareness have only succeeded in sug- 
diti £ awareness after the fact to subjects who were actually con- 

loned without awareness. 
Nen that the crux of this debate boils down to the relative 
sures, Soy кейш versus multiple-question awareness mea- 
ing aware B assertion that multiple-question techniques of assess- 
j ness elicit postexperimental reports of awareness in sub- 


Jects ў po 
hy Who weren't really aware is an often stated objection to the 
ateness positi 


both Verbal ope 
his problem is 
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as eel 


versy (Cohen, 1964; Insko and Oakes, 1966; Page, 1969; Staats, 


on by those committed to a conditioning theory of 
rant and verbal classical conditioning experiments. 
an obvious possibility inherent in the methodological 
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necessity of assessing the awareness postexperimentally. Но» 
there is evidence in verbal operant conditioning that the awa: 
occurs during and not after the conditioning procedure (De 
1964; Page and Lumia, 1968). In evoking this argument rega 
the present authors (Page, 1969) awareness results, Staats 
looked an important factor. An attempt was made in that stu 
contro! for postexperimental suggestion by including in the p 
perimental questionnaire items regarding the time at which su 
became aware. Subjects who did not clearly report that their 
ness occurred during the experiment and prior to the marking 
attitude scales were not counted as aware. In effect, Staats is : 
that a large proportion of the subjects were dishonest reg 
reports of the timing of their awareness. There is no evidene 
this occurred. The fact that those and only those who account 
the significant effect verbalized demand awareness supports th 
that suggestion by the interview procedure cannot account f 
awareness. Why was awareness not suggested in the other su 
but was only reported by enough subjects to account for the 
tioning effect and no more? 

‘The purpose of the present study was to investigate data di 
ancies between previous studies of the role of awareness of « 
mental demand characteristics (Orne, 1962) in attitude condit 
by comparing postexperimental assessment techniques. Staa 
Staats (1957, 1958) asked a single open-ended question reg 
subjects’ thoughts about the purpose of the experiment. Wi 
small percentage of subjects judged aware removed from th 
they found significance in the remaining data. In spite of a su 
tial literature in verbal operant conditioning (Spielberger а 
Nike, 1966) which demonstrates the inadequacy of a single 
ended question in detecting all the aware subjects, Staats 
still defends the validity of his technique for assessing awarent 

The basic argument against a single open-ended questio 
measure of awareness is that, while the meaning of the questic 
seem perfectly clear to its author, some subjects may misin 
the import of the question. This is supported by the present а 
observations in previous studies that some subjects’ answers 
first general and open-ended question are completely irrele 
what is being asked. In addition, many subjects may underst: 
question, but write such a brief or vague answer as to precli 


WUN ©. on -— 


шиме classification. For example, in previous studies scene subjecta 
have written simply “the purpose of the experiment was woni amo- 
tiation,” then on the extended questionnaire gone on to explain that 
they meant there was an association of pleasant and unpleasant 
meaning for certain syllables. These subjects also stated that they 
had believed they were to demonstrate the leaming of this by mark- 
ing the syllables appropriately on the rating sealer. Thus, the single 
question technique, while detecting some awareness, is not very 
accurate in its overall partitioning of awareness vera unaware- 
sea. It would веет that the procedure of removing aware mubjects 
and analyzing the remaining data would require an accurate mess- 
ure of awareness. With а single open-ended question this require- 
ment is simply not met. 

Insko and Oakes (1966) separated the concepts of contingency 
awareness, by which they meant knowledge of the consistent amoet- 
ation between the affective words (beauty, sweet, pleasure, ete.) pre 
sented verbally and а specific nonsense syllable on the visually 
presented list, and demand awareness (Orne, 1962) or 
of the purpose of the experiment. Their measure of contingency 
awareness seems straightforward and adequate. One would expert 
this multiple-question technique to separate all of the contingeney 
awares from the unawares. But, since no questions are included 
concerning the time and saliency of awareness, it might encourage 
messing and attempts to recall events that were not very salient 
during the experiment. This assessment technique may identify more 
subjects as aware than actually were, but there is little danger that 
it would leave any awares undetected. Their 10-point scoring sy*- 
tem, however, seems arbitrary and could lead to some ambiguity. 
For example, the only way a subject could receive a score of 1, 2, or 
3is to have made a number of incorrect guesses. There is no reason 
to believe that a subject who made incorrect guesses із any more UD 
aware than a subject who did not guess and thus received Mesure 
of five. Likewise, at the other end of the scale there is no reason to 
believe that a subject who received a score of eight is any less aware 

one who scores 10. Failure of a subject to be scored pe 

On the first open-ended question may mean that he is less articulate 

but does not necessarily indicate that he was less aware of the оол- 
tingeney, d 

Insko and Oakes used a single open-ended question for assessing 
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demand awareness. The question was: “Did you feel as if you were 
supposed to rate the nonsense syllables in any particular way? If 
so explain.” This seems to be a straightforward question, but it is 
subject to the same problems suggested regarding possible inaccura- 
cies in Staats’ technique for assessing awareness. What does it mean 
when a subject simply answers “no” to this question? It could mean 
that he did not interpret the question correctly or it could mean that 
he really was not demand aware. Since there are no other questions 
it is impossible to separate these two types of subjects. On the other 
hand, it would seem that a subject who says “yes” and then gives 
an incorrect explanation should be classified as clearly unaware 
rather than giving him some credit for being partially demand 
aware. The distinction between a subject who scores a two versus 
one who scores three seems also to lead to ambiguity. A subject who 
says “according to the way they were grouped” (Insko and Oakes, 
1966) could be just as aware of the demand characteristics as one 
who gives a more specific answer, but since there are no more ques- 
tions this could not be determined. What is being measured here is 
clarity of expression rather than demand awareness. There is thus 
the strong possibility that their measure and scoring procedure would 
lead to both false positives and false negatives, reducing the validity 
of the measure and hence, any correlation with conditioning. 

The present author assumed that his more elaborate technique, 
utilizing multiple and converging questions for assessing demand 
awareness, correctly classified more subjects and was more valid. 
The author's measure did, in fact, account for all the significant 
conditioning effect (Page, 1969) while Insko and Oakes' (1966) 
measure of demand awareness did not. The basic issue in resolving 
these contradictory results is the question of which assessment tech- 
nique is more valid. There were enough procedural differences be- 
tween the original studies, however, so that the difference in results 
cannot be clearly attributed to questionnaire differences without 
further evidence. In the present study, the correlation between 
awareness and attitude conditioning was investigated as а function 
of the technique of assessment of awareness, in the context of the 
same study where all subjects were given all three questionnaires 
To check on possible problems which may have arisen because of 
the necessity of not counterbalancing in the repeated measures 
design, the study was then repeated on three new groups of subjects 
where each group received only one questionnaire. 
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Subjects 


Subjects were 160 introductory psychology students at the Uni- 
versity of Nebraska at Lincoln. They were run in groups varying 
in size of from 15-30 each. Data were collected at approximately the 
middle of the university semester. 


Method 


All subjects were given the attitude conditioning procedure de- 
seribed in detail elsewhere (Page, 1969; Staats and Staats, 1957, 
1958). There were four visually-presented nonsense syllables: wuh, 
yof, laj, and giw; and 18 conditioning trials. Wuh was paired with 
spoken words having strong evaluative meaning; the other three 
were paired with neutral words. Half (N = 80) of the subjects had 
positive evaluative meaning paired with wuh; the other half (V = 
80) received negative meaning associated with wuh. Following this, 
subjects rated the four syllables (plus eight filler syllables not pre- 
viously encountered) on 9-point pleasant-unpleasant scales of the 
semantic differential type. Then they circled words on a dittoed 
E which they remembered as having appeared on the spoken 
ist. 

Subjects then responded sequentially to three postexperimental 
questionnaires regarding awareness. First they Were told to respond 
to the following question on the back of the sheet of paper used as 
the “second learning test”: “Would you write down anything you 
thought about the experiment, especially anything you thought 
about the purpose of the experiment while you were participating 
In the experiment.” This is the single open-ended question proposed 
by Staats (1969) and presumably something similar was used in his 
earlier studies. Papers were collected as subjects finished writing. 

hen all papers were collected the next questionnaire was intro- 
duced by simply saying that the experimenter now wanted them to 
a out а written questionnaire. Subjects then responded to this 
Written version of the questionnaire used by Insko and Oakes 
(1966). Each question was on a separate page of a booklet and the 
booklets were collected as subjects finished the last page. Finally, 
“ubjects were given “honesty and conscientiousness" instructions 
(Page, 1968), and then asked to fill out а booklet containing the 
âge (1969) questionnaire. Added to this questionnaire as the sec- 
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ond question were three 9-point rating scales asking subjects to rate 
their degree of attention, effort to learn all the words and boredom 
during the experiment. Otherwise the questionnaire was identical to 
that used previously. 

The questionnaires were always presented in the above order. 
This was considered a necessity, though it confounds effectiveness 
of the questionnaire with order of presentation, because the simpler 
open-ended questionnaires would not be comparable to earlier re- 
sults if they followed the more detailed and specific questionnaire. 
The possibility of the open-ended questions influencing the more 
detailed questionnaire was thought to be less important. Study П 
repeats the comparison of assessment techniques using à separate 
groups design which precludes the problem of one assessment tech- 
nique influencing the other. A repeated measures approach was ta- 
ken in this first study because the author wanted to explore cases 
where the same subject would be classified differently on the differ- 
ent measures of awareness. A direct comparison of the three assess- 
ment techniques on the same subjects seemed to be the best way of 
exploring differences between techniques. 

All questionnaires were scored blind by two judges working 
independently, according to the rules employed in the previous 
studies where each questionnaire was used. 


Results 


| The first question concerns the reliability of scoring by the two 
independent judges on the various measures of awareness. Since 
awareness was conceived of as a functional dichotomy, each judge’s 
scoring was dichotomized and reliability was measured in terms of 
phi coefficients. For the Staats question, data were already in the 
aware-unaware form. For a subject to be classified aware on this 
question he had to state that wuh was associated with words of 
pleasant meaning (or unpleasant, depending upon the condition); 
otherwise he was scored unaware. This is essentially a measure of 
contingency awareness. Since few subjects clearly stated that they 
thought they were supposed to rate the syllable according to the 
association, no scoring for demand awareness was possible. 

For the Insko and Oakes contingency awareness measure; the 
data were dichotomized by considering a score of six or below 8$ 
unaware and seven or above as aware. On the demand awareness 
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measure the data were dichotomized between a score of one and 
two. For the Page measures of awareness, the data were dichoto- 
mized between a score of two and three on the 4-point scale of 
definitely unaware to definitely aware. 

Table 1 presents the resulting phi coefficients (over their appropri- 
ste phi-max values) for scorer reliability. Below each phi over phi- 
max is presented, in parentheses, the ratio of phi over phi-max 
which will aid the reader in comparing the relative strengths of the 
various associations. It may be seen that scorer reliability is no 
problem for the Staats questionnaire. The judges agreed almost 
perfectly as to how all subjects should be scored. The dichotomous 
scoring of the Insko and Oakes questionnaires results in lower relia- 
bilities than reported in the original study. While the judges seldom 
disagreed more than one scale score, it happened that the cut-off 
points selected were the ones where judges disagreed the most. Par- 
ticularly on demand awareness, the judges had the most difficulty 
agreeing on whether a subject’s explanation of his “yes” was incor- 
rect or partially correct. When Pearsonian correlations were com- 
puted on the full scale scores, the reliability for contingency aware- 
ness was т = .93 and for demand awareness was T = 89. These 
figures are more comparable with the reliabilities reported by Insko 
and Oakes. It may be seen that the reliabilities for the Page aware- 
ness measures are adequate but not especially high. This is probably 
due to a failure on the part of the author to adequately train the 
other judge on the specific criteria for judging a subject to be aware: 
More of the disagreements were in the direction of the other judge 
attributing awareness where the author did not. This situation was 
corrected prior to the scoring of Study П, and as will be reported, 
the reliabilities were much higher in that study. 

A more important question concerns the correlation of the aware- 
tess measures with the conditioning behavior. Table 1 also presents 
these phi correlations. The awareness dichotomies used here were 
based on the average of both judges’ ratings. The 9-point sung 
маје (conditioning) was dichotomized by considering a subject in 

© positive condition who scored one or two to show conditioning, 
the others were considered not to have shown conditioning; this fit 


b bimodality of the data. This scoring was reversed in the negative 


“ondition, and then the data were pooled for the correlations. It may 
be reliably scored, 


* seen that while the Staats questionnaire can 
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TABLE 1 


Reliability and Validity of Three Postesperimental Measures of Contingency and D 
= Sere mds T mar cued ec PM acr Phim 


jonnaires 
Staats Insko & Oakes 
Contingency Contingency Demand Contingency 

Correlations awareness awareness awareness awareness 
Scorer .88/.90 .84/.92 71.97 .72/.83 

reliability (.985 (.91) (.73) (.87) 
Correlation with .34/.63 .57/.90 .41/.71 .57/.87 

criterion oon- (654) (.63) (.57) (.65) 

ditioning 


* Дацо of рЫ over рад. 


it does not correlate especially high with conditioning. This could 
be, as Staats claims, due to the fact that his open-ended question: 
does not suggest as much awareness to genuinely conditioned sub= 
jects. But, in the context of the rest of the data it is more likely that 
the open-ended question is simply not a very valid measure of 
awareness. 

The Insko and Oakes, and the Page measures of contingency 
awareness correlate with conditioning to the same degree, and at 
about the same levels reported in the original studies. A discrepancy 
between the measures of demand awareness also occurs, as would be 
expected, from the correlations reported in the original studies. The 
Page measure correlates much better with conditioning than does 
contingency awareness, while the Insko and Oakes measure does. 
not correlate as well as contingency awareness. The dichotomized 
scoring used in this analysis actually makes the Insko and Oakes 
measure a little more valid than the original scoring method. The 
Pearsonian correlation using the full scale scores for both demand 
awareness and conditioning was r = +.34, which is more compara- 
ble to the original study. s 

It is possible to attribute this discrepancy between the Page, and 
Insko and Oakes measures to a lack of validity in the Insko and 
Oakes open-ended question approach. Examination of the Ns in 
Table 2 reveals that the Insko and Oakes measure identifies fewer 
subjects as aware. What is not apparent in this table is that the 
discrepancy is even greater than this, because the Insko and Oakes - 
measure also identifies several subjects as demand aware which | 
were not later demand aware on the more elaborate Page question- - 
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saire. The subjects aware on the Page measure, but not on the Insko 
and Oakes measure, in general show high conditioning (14 of 16). 
The ones aware on the Insko and Oakes measure, but not on the 
Page measure, in general do not show it (1 of 7). Thus, it is possible 
to suggest that the Insko and Oakes measure misclassifies a number 
of subjects or is of low validity. Therefore, it would not be expected 
to account for all the variance in attitude conditioning, even if, as 
the author has suggested, demand awareness is the crucial variable 
in socalled attitude conditioning. 

It is the usual practice in both verbal operant and classical con- 
ditioning studies to remove aware subjects from the data before 
analysis. Such a procedure implicitly assumes a valid measure of 
Awareness. Table 2 shows what happened to the conditioning effect 
when subjects aware by the various measures used in this study are 
removed before analysis. Notice first that the total data showed а 
strong conditioning effect (t = 5.19, p « .001). The Staats question 
identifies 19 of 160 subjects as contingency aware. Four of these 
Were contingency, but not demand aware on the Page measures, 
and they did not show conditioning. When the Staats awares are 
removed the means are less discrepant, but the difference is still 
highly significant (t = 3.83, p < .001). This is exactly what would 
be expected if the validity of the measure was not adequate and 
many awares were left in the data. Notice that ne? pe oe 
contingency awareness measure identifies 60 of 160 subject 
aware, and removal of these from the data leaves the remaining 


TABLE 2 
Mean Conditioning with Awares Removed by Various Measures of Азга 
Direction of conditioning 
Subjects Positive . 

Total data—none removed 2 E im 2а = 59 
Staats’ CA removed = - oe x E 0 
Insko & Oakes CA removed es = - = Bo 
Page CA removed 3 3 E s a B 
Insko & Oakes DA removed I = ue T T4 
ui = 4.4 
Page DA removed x > кн x - 5 а 
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data pot significant (t = 44). According to the author's theory 
(Page, 1969) of the role of demand awareness in this situation а 
guod measure of contingency awareness should do just that, because 
demand awareness requires that a subject first be contingemey 
aware, Notice next that the Page measure of contingency awareness 
aho renders the remaining data nonsignificant (t = .95) by me 


very strong support for the demand characteristics position. 

As previously suggested, the Insko and Oakes measure of demand 
awareness seems to lack validity, It removes 29 of 160 subjects 
from the data, but the remaining data are still highly significant 
(t = 3.05, p < 01). In fact, some of the subjects (N = 5) identi 
fied as demand aware by this measure are not even contingency 
aware by any measure; this would be a logical impossibility if the 
measure were valid. On the other hand, the measure leaves a number 
of subjecta (У = 14) who are contingency aware and high condi- 
tonere in the data. It is precisely these subjects that are identified 
за demand aware by the Page questionnaire. 

Reeall that subjects were asked to rate their attention, effort to 
learn and boredom on 9-point rating scales. Subjects were divided 
into demand aware versus unaware groups on the basis of the Page 
measure. Table 3 presents these data. The aware subjects reported 
more attention (f = 2.72, р < .01), more effort to learn all tbe 


TABLE 3 
Moan Reported Attention, E Gort to Learn and Boredom for Demand Aware (N = $8) 
ues bes OF = 100) Pajete — 


Hem Aware Unaware 
ree eae 
re eee 6.92 5.94 
tried to all the words 6.77 5.74 
I was not especially bored, to I 
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(t = 294, p < .01) and less boredom ( = 276, p < 0!) 

the experiment than the unawares. While these data do not 

inguish between the predictions of conditioning verus demand 

Wrareness theory, they do suggest that attentional and motivational 

are important in distinguishing between subjects who are 
aware and show the conditioning effect and those who do not. 


Study П 
ГУА 


Subjects were 300 introductory psychology students at the Uni- 
weity of Nebraska at Lincoln. They were run in groupe varying 
Ji site from 15-30 each. Data were collected slightly past the mid 
die of the university semester. 


Method 


The method and procedure was the same as in Study I with the 
following exceptions: (a) since the data were to be шей only for 
ferrelational purposes, all subjects were run in the negative condi- 
tion rather than reversing conditions for half the subjects; (b) eub- 
jets were divided into groups of 100 and each group was given 8 


And Page questionnaires in sody ана e чем 
ing in detail the scoring criteria for lisagreed in Study 
the protocols for subjects where the judges had ice 
Were discussed in detail. It was expected € the question- 
Korer reliability in this study. The judges then 
Páires from Study II independently. 
Results ie 
The results of this study are presented in Tal de all sd 
this table with Table 1 reveals that the scorer г үчн and 
Sires is higher in this study. With the exception ii yhak М. remarka- 
demand awareness measure, these T 
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bly high, demonstrating at least that judges agree more after they 
have practiced and discussed than before. 

Inspection of the pattern of correlations between dichotomized 
awareness measures and the conditioning scale dichotomized at eight 
and nine (high conditioning) versus one through seven (low condi- 
tioning), reveals a pattern similar to that found in Study I. The 
poorest correlate with conditioning is the Staats open-ended ques- 
tion and the strongest is the Page measure of demand awareness. 
Since this was a separate groups design, and since the correlations 
are approximately the same as in the earlier repeated measures 
design, it may be concluded that order of presentation probably did 
not have an important effect on the outcome of Study I. Since this 
replication again finds demand awareness during the experiment as 
measured by the Page questionnaire to be the strongest correlate 
with conditioning, it again points to the crucial importance of this 
variable. 


Discussion 


These studies have brought evidence to bear on the questions 
raised earlier (Page, 1969) concerning the nature of demand aware- 
ness and the appropriate method of assessing it postexperimentally. 
It appears clear now that the Insko and Oakes (1966) approach to 
scoring responses to a single open-ended question is more of а meas- 
ure of clarity of expression in responding to the question as well as 
of subjects’ understanding of what is being asked than it is a meas- 
ure of demand awareness. These variables are correlated but not 
strongly. What is required is a multiple question approach, so that 


TABLE 4 


Reliability and Validity of Three Postexperimenial Measures of Contingency and Demand 4 
ness in Study II Expressed as Phi over Phi-maxz 


Questionnaires 
ze Bisse лок Раве а 
i ency ontingency Demand Contingenc: 
Correlations awareness awareness осе аео awa 
Scorer 76/.76 92/.96 © 
т 3 .92/. .78/.9 .89/.98 — ^ 

reliability _ (1.00): (96) ( M a 0) l 
Correlation with .26/.61 .52/.84 .58/.88 69/87 7 
criterion con- (.42) (.62) (.60) i 67) ( 

ditioning ? ; 
PERDE 


* Ratio of phi over phi-max. 
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there is less possibility of a breakdown in communication. The sub- 
ject has to understand what is being asked and he has to write 
enough so that the judge can understand what he meant. 

There are numerous studies in the current literature which either 
conclude that conditioning ean occur without awareness because 
unawares by their measure show a significant effect or that condi- 
tioning is a function of awareness because unawares by their meas- 
ure showed no effect. It seems that the resolution of these сопіга- 
dictory conclusions is basically a measurement problem. Attention 
should not be focused on the contradictory conclusions but on the 
differences in measurement operations. The logic of classifying sub- 
jects as to awareness, removing the awares from the data and then 
drawing conclusions about the remaining data requires accurate 
classification, The present studies suggest that brief open-ended 
measures of awareness are simply not valid, and this can account 
for the current proliferation of contradictory data. An interesting 
fact, often overlooked because attention is focused upon the presence 
or absence of conditioning in subjects classified as unaware, is that 
aware subjects by any measure always show much more condition- 
ing than unawares. Thus, there is the strong possibility that only à 
few awares left in the data because of an invalid measure would 
result in “conditioning without awareness," while a more valid meas- 
ure of awareness would not. d 

More generally, one often encounters deception experiments m 
ihe contemporary social psychology literature where subjects were 


asked an open-ended question concerning subjects’ knowledge of 


the purpose at the conclusion of the experiment. The authors usually 


indicate that a certain small percentage of : 
the deception and that these were eliminated prior to the analysis. 
But, we now ask, how many others also were aware and were not 


detected? What effect might these undetected demand awares have 


on the significance of the data? In the light of the present data, it is 
ate extended post- 


strongly recommended that we develop appropri ] 
eXperimental questionnaires and use them whenever conducting a 
deception experiment. Especially this should be done with experi- 
mental situations where in the past а few awares have been detected 
Using the single question approach. Whenever a few verbalized 
Awareness to an open-ended question there may have been several 
More who would have been detected by a more valid technique. 


subjects saw through 
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It is often warned that the use of extended awareness interviews 
ean lead to the suggestion of verbalizations of awareness after the 
fact. The present author is well aware of this possibility, therefore 
controls for this were built into the Page (1969) questionnaire, It 
should be pointed out, however, that this excuse for not using an 
adequate questionnaire has been overworked, particularly by those 
with some theoretical investment in not discovering that, after all, 
subjects in human experiments do think about the purpose of what 
they are required to do. The author feels that it is difficult to sug- 
gest something as basic as whether the subject knew what was going 
on in the experiment or not, especially with a more elaborate ques- 
tionnaire. With a single open-ended question the subject’s answer 
may be so vague that it may suggest to the scorer that he might 
have been aware when in fact he was not. Rather than suggesting 
awareness, more elaborate questionnaires actually reduce the possi- 
bility of the judge reading into the subject’s reports. 

There is evidence in this study, however, that some subjects can 
reflect back on their previous experience and verbalize the correct 
contingency when, in fact, it was not very salient during the experi- 
ment, at least when there are no controls in the questionnaires {40 
prevent this. The Insko and Oakes contingency awareness measure 
is especially vulnerable to this, because subjects are asked to verbal 
ize the contingency in any way they can, disregarding time or sal- 
косу of awareness. Recall that in Study I the Insko and Oakes 
contingency awareness questionnaire identified 10 more subjects 8% 
aware than did the Page measure of contingency awareness. The 
Page measure considered both verbalizations of the contingency and 
» rather stringent criterion of when the awareness occurred. These 
10 subjects were indeed ones who on the Page questionnaire said, 
in effect, “Yes, I recall some bad words associated with wuh but 
I'm not really sure this was always the case and I didn't think much 
about it until afterwards.” If, indeed, it is demand awareness dur- 
ing the experiment that mediates attitude conditioning, then sub- 
jects contingency aware after the fact would not be expected to show 
conditioning and, in fact, they do not. A high correlation between 
awareness and behavior depends upon accurate assessment of aware 
ness. Subjects who are aware but missed by an insensitive assess 
ment technique as well as subjects unaware but suggested aware bY 
a probing technique both serve to reduce correlations. It is here 
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recommended that а solution to this problem is to use а more ex- 
tended and probing questionnaire so as to avoid missing aware 
subjects, and to include questions concerning the timing and saliency 
of awareness so as to avoid suggesting reports of awareness in 
unaware subjects. 

There is the remote possibility of suggesting awareness in some 
subjects even with a controlled questionnaire; however, it seems sci- 
entifically more rigorous and defensible to risk suggesting awareness 
to a few subjects who really were not, than to risk not identifying 
subjects who actually were aware if, as is usually the case, one is 
concerned with the extent of the experimental effect in subjects who 
are truly unaware. 


Summary 


Two studies compared three postexperimental techniques for as- 
sessing awareness in attitude conditioning. It was found that multi- 
ple question techniques for assessing both contingency and demand 
awareness resulted in stronger correlations with conditioning than 
did single open-ended question techniques. This was attributed to 
ambiguities inherent in the open-ended technique which leads to 
miselassification of many subjects as to their awareness. The possi- 
bility of suggesting awareness by using extended questionnaires was 
discussed, and it was concluded that questions regarding the time 
and saliency of subjects’ awareness should be included in question- 
naires, 
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А PROJECTIVE OCCUPATIONAL ATTITUDES TEST! 


LeROY C. OLSEN 
AND 
WILLIAM H. VENEMA 
Washington State University 


Оссорлтюхли counseling is a personalized process which at- 
tempts to aid the individual in understanding his values as they 
relate to occupational selfactualization. Goodstein (1965) sug- 
tested that an adequate theory of vocational development, choice, 
and adjustment must take into consideration both external reality 
factors and psychodynamies. Forer (1965) suggested that а com- 
Prehensive occupational theory must account for processes and 
problems of developing skills, knowledge, efficiency, productivity, 
creativity, attitudes, and interpersonal relationships. Ў 

Means of assessing occupational attitudes are limited in nature. 
Frequently used standardized paper-and-pencil interest tests such 
аз the Kuder Preference Record and the Strong Vocational Interest 
Blank may only superficially consider attitudes. Allport (1954) be- 
lieved that such instruments were helpful to a degree, but they 
dealt with the typical variables, and thus provided very little in- 
formation about the unique motivation ог the underlying poten- 
tiaities of a single case. 

While the results of research to date do not appear to have pro- 
duced significant results in all cases, certain facts seem apparent. 
First, most of the research has been conducted with college students 
Or professional groups; second, standard projective devis. have 
been used rather than adapting or developing devices utilizing 
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occupational themes; and third, projective devices have been us 


primarily for the purpose of studying personality rather than. 
titudes. 


Purpose of the Study 


The purpose of this study was an attempt to validate am 
standardize a projective device for measuring the attitudes of cul 
turally disadvantaged youth toward work roles and work environ: 
ments. It was assumed that the following aspects of attitudes could. 
be measured by a projective occupational attitude technique: 

1. That the projective technique would provide an index of at- 
titudes towards the following occupational aspects; tasks 
tools, equipment, working environment, and interpersonal re 
lationships representative of certain occupations; 

2. That the projective test would provide meaningful atti | 
information that could be classified, validated, and would di 
tinguish between various groups; and 1 

3. That the projective test, would provide meaningful attitudinal 
information in relation to self-concept, needs, and occupa 
tional choices, 


achievement, and (d) satisfaction-enjoyment; plans for occupa 
tional advancement; (a) immediate economic press and (b) de: 
layed need gratification; and anxiety and frustration in (a) meet- 
ing job qualifications and (b) gaining entrance and employment in 


ч ре sonal relationships with other workers; through identifi 


? Supervisory or authority figures; 
P (а) tasks, (b) tools, (c) equipment, and (d) working con- 
ions. { 


Procedure 
Sample 


One portion of the sample was obtained from the Columbia Job 
Corps Center located at Moses Lake, Washington. Youths selec 
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for the Job Corps included those who: (а) were schoo! dropouts, 
(b) were unable to find or hold jobs or lack marketable skills or 
expressed vocational goals, (c) had poor school performance, (d) 
were unable to pass the educational part of the Selective Service 
Examination, and (e) had a self-concept of defeat and failure. A 
total of 88 enrollees met the criteria for selection and constituted 
the study sample. Of the 88, a total of 47 or 53.41 per cent were 
Negro and 41 or 46.59 per cent were Caucasian. From each of the 
Negro and Caucasian groups a random sample of 15 subjects was 
selected for the study. 

A secondary school and two junior high schools provided a sam- 
ple of 384 pupils from the Tacoma Public Schools, Tacoma, Wash- 
ington. Subjects were selected on the basis of variables similar to 
those necessary for persons enrolling in the Job Corps. Of this 
total, 91 or 23.35 per cent met the study criteria. From each of the 
Negro and Caucasian groups a random sample of 15 subjects was 
selected for the study. Two groups were selected from the secondary 
school, and two groups from each of the two junior high schools. 
The total sample included 90 subjects. 


Measuring Instrument 


Olsen (1966) developed a projective instrument, Projective 
Occupational Attitudes Test (P.O.A.T.) designed to assess occupa- 
tional attitudes of non-college bound pupils. The test Loren of 
10 pictures mounted on heavy paper portraying five major dimen- 
sions of common nonprofessional level male situa- 
tions, and one blank card. The dimensions portrayed ahd (a) acts, 
fb) tools and/or equipment, (c) materials, (d) working environ- 
ments, and (e) interpersonal relationships. The test attempts to 
Provide a measure of occupational attitudes of noneollege bound 
Individuals in the following occupational areas: Distribution, Car- 
pentry, Electrical, Store Clerk, Farm Work, Heavy Construction, 
Forestry, Traffie, Janitorial Work, Service Station, and а Blank 
Card. Tt should be recognized that this study was exploratory and 
represented an original instrument and the initial research with 
lid instrument, Complete scoring details are contained in the 
Original report. 
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Statistical Treatment of Data 


Intelligence test standard deviation scores, reading grade level 
scores, family stability, occupational models, and occupational 
status of subjects’ parents were used for classifying subjects into 
various categories. The Fisher Exact Probability Test and the 
Chi Square test were applied to determine the significance of 
proportions of subjects falling into various categories between 
groups. In situations appropriate for testing the level of signifi- 
cance of difference between two means the Ё test was applied. 


Results and Discussion 


Seven variables were considered under total response: percep- 
tion and response, movement in story, response to the total card 
situation as compared to some aspect, a simple description as com- 
pared to an organized story, acceptance as compared to rejection 
of the stimulus card, projection—degree and kind, and indications 
of liking or disliking the work pictured on the cards. 

There were no significant differences between groups in propor- 
tions of subjects falling into the various categories on the following 
variables related to total response: movement in story, acceptance 
as compared to rejection of the stimulus card, projection—degree 
and kind, and liking compared to disliking the portrayed work. 

Senior high Negroes gave significantly more adequate responses 
than senior high Caucasians, .01 level; junior high Negroes, Ó! 
level; and junior high Caucasians, 05 level. Total Job Corps 
enrollees gave significantly more adequate responses than total 
loz high pupils, .05 level. Total senior high pupils gave sig- 
nificantly more adequate responses than total junior high pupils, 
.05 level. 

Senior high Negroes responded significantly more frequently ® 
the total card situation than senior high Caucasians, .01 level; Job 
Corps Caucasians, .05 level; and junior high Caucasians, .05 level. 
Total senior high Caucasians responded signicfiantly more fre- 
quently to the total сага situation than total junior high school 
pupils, .02 level. 

The majority of subjects’ total responses, with the exception of 
senior high Negroes and total senior high pupils, were classified 9 
minimal or inadequate in perception and response to the stimulus 
cards. Most subjects gave simple descriptions rather than organize 
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responses. Movement was usually included in these responses. The 
subjects appeared to view being employed in any occupation as 
being more desirable than being unemployed or identifying with a 
prestigious occupation. 

The results for physical aspects of work indicated a low level of 
concern about tools and equipment. The subjects responded more 
frequently to tasks and interpersonal relationships than tools and 
equipment. 

Only two of the 19 comparisons between groups on attitudes 
toward equipment were significant, Senior high Negroes were sig- 
nificantly more concerned with equipment than senior high Cau- 
easians, .02 level, and junior high Negroes, 05 level. Those sub- 
jects who did respond to tools and equipment were usually unable 
to accurately describe the appropriate uses of the tools and equip- 
ment. 

The subjects were more concerned with tasks than the other three 
variables. Only one of the comparisons between groups for tasks 
was significant. Junior high Caucasians were significantly more 
concerned with tasks, .05 level, than senior high Caucasians. 

Senior high Negroes were significantly more concerned with work 
environment than junior high Negroes, .02 level, and junior high 
Caucasians, .05 level. Total Job Corps enrollees were significantly 
more concerned with work environment, .05 level, than total junior 
high pupils. Total senior high pupils were significantly more con” 
cerned with work environment, .05 level, than total junior high 
pupils, 

The aspects of work environment were related to the placement 
of figures in identifiable surroundings. Senior high Negroes pre- 
ferred occupations performed outside, while other groups did not 
indicate a definite preference for occupations performed either out- 
side or inside. Junior high Negroes were Jeast concerned with work 
environment than other groups. 

In relation to other variables, security 
all &roups. Senior high Caucasians were m 
uy, 05 level, than Job Corps Negroes. Tot 
Significantly more concerned with security, 
Job Corps enrollees. Total senior high pupi 


үте concerned with security, .05 level, than 
Pupils, 


was the chief concern of 
ore concerned with secur- 
al junior high pupils were 
05 level, than total 
Js were significantly 
total junior high 
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The need for satisfaction and enjoyment in work was ranked 
second by all groups. The satisfaction and enjoyment variable was 
concerned with a sense of pleasure or gratification obtained from 
work. The results indicated that senior high Negroes were sig. 
nificantly more concerned, .01 level, than Job Corps Negroes. Senior 
high Caucasians were significantly more concerned, .05 level, than 
Job Corps Negroes. Total senior high pupils were significantly mom 
concerned, .02 level, than total junior high pupils. Job Corps 
Negroes were less concerned about satisfaction and enjoyment in 
work than other groups. Total senior high pupils tended to identify 
at higher occupational levels than other groups. 

The achievement variable tended to differentiate between groups 
more frequently than any of the POAT variables. The achievement 
variable was concerned with such factors as going to school, getting 
а job, work experience, and occupational advancement. Ten of the 
19 comparisons between groups were significant. The results in- 
dicated that senior high Negroes were significantly more con- 
cerned with achievement than Job Corps Negroes, Job Corps Саш 
easians, and junior high Negroes, 01 level; and junior high 
Caueasians, .05 level. Senior high Caucasians were significantly more 
concerned with achievement than Job Corps Negroes and Job 
Corps enrollees, .05 level. Total senior high pupils were significantly 
more concerned, .01 level, than total junior high pupils. Total 
junior high pupils were significantly more concerned with achieve 
ment, .05 level, than total Job Corps enrollees. Job Corps Negroes, 
Job Corps Caucasians, junior high Negroes, and junior high Саш 
casians were least concerned with achievement. There were n9 
significant differences between groups in the proportions of subjects 
falling at either the mean or above category, as compared to the 
below-the-mean category for security-money. The need for money 
or the concern over money was ranked fourth by the total group- 

{ Senior high Negroes were significantly more concerned with pres- 
tige than junior high Caucasians, 01 level; junior high Negroe 
01 level; and senior high Caucasians, .05 level. Total senior high 
pupils were significantly more concerned with prestige than total 
pe high pupils, .01 level, and total Job Corps enrollees, 0 

Senior high Negroes were significantly more concerned with in- 
terpersonal relationships, 02 level, than junior high Negroes. Total 
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senior high pupils were significantly more concerned with inter- 
personal relationships, .02 level, than total junior high pupils. 

Junior high Negroes were less concerned over interpersonal rela- 
tionships than other groups. They also tended to identify at lower 
occupational levels. There were no significant differences between 
groups for dependency. In general, there tended to be а low concern 
over dependency by all groups. 

The results for the POAT primary occupational identification 
levels tended to reflect the earlier results of the subjects’ selections 
of the POAT Blank Card occupational identification levels. Senior 
high Negroes identified at significantly higher occupational levels 
than Job Corps Negroes, .01 level; junior high Negroes, 01 level, 
Job Corps Caucasians, .05 level; and junior high Caucasians, 05 
level. Senior high Caucasians identified at significantly higher оеси- 
pational levels, .02 level, than junior high Negroes. Total senior 
high pupils identified at significantly higher occupational levels than 
total Job Corps enrollees, 01 level, and total junior high pupils, 
O1 level. 

The results were similar to those found in subjects’ selections of 
Blank Card primary occupational identification levels. ‘The results 
also indicated that senior high Negroes aspire to or identify with 
Occupations approximating the semiprofessional levels (Rot). 
Junior high Negroes identified at lower levels than all other groups. 

There were no signifiant differences between groups for ac- 
ceptance or rejection of the primary identification figures. As а 
total group the majority of subjects tended to accept the э 
occupational identification figures. There were no 827 D 
ences between groups for favorable authority relationships m 
рагей to unfavorable authority relationships. Junior high e 
tended to be slightly more resentful of authority figures than 
groups. Д E ‚ 

Senior high Negroes utilized defense mechanisms in their re- 
sponses significantly more frequently than Job —MÀ 
Job Corps Caucasians, junior high Negroes, quar high A 
asians, and senior high Caucasians, 0l level. Total senior ? 
a utilized defense mechanisms significantly more етеу 

ап total junior high pupils. 3 

иа А: Ere lack of parental influence in the sub- 


jects’ perceptions and attitudes toward work. There were no sig- 
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nificant differences between groups concerning attitudes and feelings 
toward parents. Inspection of the data revealed that only 96 re 
sponses relating to parents were made by the 90 subjects. | 

No significant differences were found between groups for the two 
variables relating to long range occupational goals. The importance 
of these results is evident because each subject was specifically 
instructed to respond by indicating what the persons portrayed 
in the stimulus cards felt the future held for them. It seems that 
the press for security and immediate need gratification fosters at- 
titudes which are probably related to immediate employment and 
need gratification, rather than to long range occupational planning 
or interest. 

One significant difference was found in the comparisons involving 
occupational advancement, Job Corps Caucasians indicated more 
frequently that the portrayed figures on the cards would move up- 
ward in occupations than Job Corps Negroes; this was significant 
at the .02 level. The majority of the subjects felt that the portrayed 
figures would be static or remain at their present occupational - 
level. 

The results indicated no significant differences between groups 
for concern over ability and aptitude in meeting job qualifications 
in need satisfying oceupations. The results suggested a limited 
perception by subjects of their aptitudes and ability as these vari- 
ables relate to employment, 

Senior high Caucasians were significantly more concerned over 
possible denial of training, skill, and education, .05 level, than 
junior high Negroes. Total senior high pupils were more concerned 
with possible denial of training, skill, and education, .001 level, 
than total junior high pupils. 

As a total, the subjects responded more frequently to possible 
denial of training, skill, and education than to any of the variables 
related to anxiety and frustration in meeting job qualifications; the _ 
опе exception was junior high Negroes. Junior high Negroes ар- 
peared to be least concerned with anxiety and frustration in meet- 
ing job qualifications. Significant differences, at the .05 level, Wer 
found with junior high Caucasians being more concerned ovet 
portrayed figures lacking training, skill, and education than senior 
high Negroes. Job Corps Negroes were significantly more concerned - 
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over combined training, skill, and education and lack of training, 
skill and education, .05 level, than junior high Negroes. 

Junior high Caucasians were significantly more concerned over 
the future, .05 level, than senior high Negroes and senior high 
Caucasians. There were no significant differences between groups 
{ог concern over possible denial of prestige. 

Senior high Negroes were significantly more concerned over possi- 
ble denial of achievement than Job Corps Negroes, 01 level; Job 
Corps Caucasians, .01 level; junior high Negroes, 02 level; and 
junior high Caucasians, .05 level. Senior high Caucasians were sig- 
nificantly more concerned than Job Corps Negroes, .01 level; junior 
high Caucasians, .01 level; and junior high Negroes, 02 level. 
Total senior high pupils were significantly more concerned with 
possible denial of achievement than total junior high pupils, 001 
level. 

There were no significant differences between groups in propor- 
tions of subjects falling into the delayed as compared to the im- 
mediate need gratification categories. The results suggested that 
concern for security acts as a major deterrent to delayed need 
gratification. 

The subjects were asked to rank the stimulus cards in the order 
they liked best. It was found that the Blank Card was ranked 
first. The results suggested that the responses to the Blank Card 
Were probably a reliable indicator of the subjects’ occupational 
identification level. The wide variety of responses to the cards, 
though descriptive in nature, indicated that the subjects were re- 
sponding to their individual perceptions and needs. 

The results suggested that age and educational level were con- 
tributing factors in the differences found between group means for 
response time, cue time, and total response timê. Significant differ- 
ences at the .01 level were found between the following groups: 
Junior high Caucasians used less time than Job Corps Peon, 
junior high Caucasians used less time than senior high Caucasians, 
and total junior high pupils used less time than total Job Corps 
enrollees. x 

Cue time was defined as the time between the presentation of the 
stimulus card and the subject's first response to the card. Sig- 
nificant differences at the .001 level were found between the follow- 
mg groups: Junior high Negroes required more time to respond 
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than Job Corps Negroes and Job Corps Caucasians: junior high | 
Caucasians required more time to respond than Job Corps Cam 
casains; junior high Negroes required more time than senior high 
Negroes and senior high Caucasians; total junior high pupils me 
quired more time to respond than total Job Corps enrollees; and 
total junior high pupils required more time to respond than total 
senior high pupils. 

Total response time was the length of time taken to complete the 
total test: а combination of response time and cue time. Significant 
differences at the .01 level were found between the following groups 
with Job Corps Caucasians requiring more time than junior high 
Negroes and junior high Caucasians, and with total Job Corps 1 
enrollees requiring more time than total junior high pupils. Sig- 
nificant differences at the .05 level were found with Job Corps 
Negroes requiring more time than junior high Negroes, and with 
senior high Negroes requiring more time than junior high Саш 
casians. Differences between groups on total response time appeared 
to be related to age and work experience. 

The results of this study must be viewed as tentative in nature 
due to the exploratory aspects of the POAT, a lack of data from 
other groups for purposes of comparison, the small size of the sam- 
ple, and a lack of similar instruments which could be used for pur 
poses of validation. However, the POAT seemed to provide 8 
method of studying and analyzing the attitudes and perceptions of 
culturally disadvantaged youth towards work. 
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THE VALIDITY OF MEASURES 
OF EYE-CONTACT 


MARVIN E. SHAW, J. THOMAS BOWMAN, axe 
FRANCES М. HAEMMERLIE 


University of Florida 


Tue significance of mutual eye-contact for interpersonal behavior 
has been noted by a number of writers (e.g., Simmel, 1921; Heider, 
1958), but it was not until Exline's (1963) study that this phe- 
nomenon began to be examined under controlled conditions. Since 
that time, numerous experimental studies have been reported con- 
cerning the relationship of eye-contaet to other aspects of the in- 
terpersonal situation (Exline, Gray, and Schuette, 1965; Argyle 
and Dean, 1965; Efran, 1968; Efran and Broughton, 1966). In the 
typical experiment, two persons engage in conversa during 
which the frequency and duration of eye-contact is recorded by ob- 
servers located behind a one-way vision screen. Observers record 
tye-contact by means of push-buttons which operate pens on = 
event recorder. It is, therefore, possible to сот the 
scores of two or more observers, thus obtaining evidence of the 
reliability of such measures. The evidence suggests that good m 
Observer reliability can be achieved; for example, Exline (1963) 
reported inter-observer reliabilities of .97 to 98. : 

Curiously enough, little attention has been given p — 
of the validity of such measures; i.e., the degree to which such € 
sures reflect, actual eye-contact. Exline (1963) cited an unpubli 


study by Gibson and Davidson (later published by Gibson and 


Pick, 1963) as evidence that two individuals ean judge accurately 
When they are looking into each other's eyes. This same a 
Was cited by Argyle and Dean (1965). Unfortunately, пол 
9f the report by Gibson and Pick (1963) reveals that this study did 


919 


920 EDUCATIONAL AND PSYCHOLOGICAL M EASUREMENT 


not deal with eye-contact at all. Instead, Gibson and Pick wem 
interested in line of gaze; they demonstrated that subjects ea 
judge accurately whether another person is fixating above, below, 
or on the bridge of the nose. 

There appears to be only one other study that referred to the 
validity of such observations. Exline, Gray, and Schuette (1965) 
reported a pretest in which an interviewer maintained constant 
fixation on the subject’s eyes, while both he (the interviewer) and 
&n observer recorded frequency and duration of the subject's visual 
fixations on the interviewer. Since the interviewer was always 
available for eye-contact, one may assume that mutual @ye-contadt 
occurred each time the subject fixated on the interviewer's eya. 
Exline et al. reported that the observer and interviewer agreed on 
88 per cent to 98 per cent of their judgments. They gave no detail 
concerning distance between interviewer and subject, distance be 
tween observer and subject, number of subjects, or how agreement 
was computed. 

The purpose of the present study was to provide empirical evi 
dence concerning some of the implicit assumptions about eye 
contact and its measurement that are inherent in the kinds of ехе 
periments cited above, namely, (a) that two individuals can judge 
accurately the frequency and duration of mutual eye-contact, and 
(b) that an observer can validly judge when two other persons are 
looking into each other's еуез. Since distance between the members 
of the dyad has been shown to be an important determinant of 


eye-contact as usually measured (Argyle and Dean, 1965), dats 
were collected at several inter-person distances. 


Study 1 
Method 


The first part of this investigation attempted to determine whether 
two persons could judge accurately when they were looking into 
each other's eyes. Two persons, one male and one female, who had 
Previously served as observers in an eye-contact study, engaged it | 
two ten-minute discussions at each of four inter-person distances: | 
2% ft., 4% ft., 8 ft, and 12 ft. Each person had a concealed push- 

which Was connected to an Esterline Angus Event Recordet 
їп an adjacent room. During the discussion, each perso? 
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pabed his button each time he believed that he had made eye- 
contact with the other person and released it when he believed eyo- 
contact was broken. Since neither knew when the other had 
pathed his button, each person's observation was independent of 
the other person’s observation, except that both depended upon 
mutual eye-contact. Agreement between these two judgments pre- 
vamably reflects actual eye-contact. 


Results and Discussion 


Each discussion period was divided into one-minute unite and 
frequency nd duration scores computed for each unit. The corre- 
lations between the two sets of frequency scores (number of eye- 
contacts per minute) in each session are shown in Table 1 and 
correlations between duration scores (amount of eye-contact per 
minute) are given in Table 2. As can be seen, correlations ranged 
from .59 to 1.00, with average correlations (computed by 2 trans- 
formation procedures) ranging from .85 to .98. Agreement regard- 
ing duration of eye-contact was significantly less at 216 ft. than at 
ЗА. (p < 03) or at 12 ft. (р < .05). This effect was due entirely to 
the low correlation obtained in the first session. This low agree- 


TABLE 1 
Correlations between Frequency of Eye-Contact Reported by Members of a Dyod 
as a Function of Distance 
el 
Distance between Members of the Dyad 
23 ft. + T Aft 8 ft. 1 ft. 
— Ñ ame "UU AT ME 
iion 1 97 1.00 ГД 
Session 2 ye .80 87 92 
= 91 .92 97 ГА 
— ا ا‎ 93 
TABLE 2 
Correlations between Duration of Eye-Contact Reported by Members of а Dyad 
as a Function of 
Distance between Members of the Dyad 
ion 1 .98 .98 
A % » 25 s 
Mean == E 06 
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ment could have been caused by lack of adjustment to the record. | 
ing procedure, or to the fact that such close proximity was relatively 
uncomfortable—a fact noted by the participants. 

In general, however, the correlations between judgments of раг 
ticipants were as high as could reasonably be expected, assuming 
some error of judgment due to momentary distraction or involve 
ment in the discussion topic. Hence, we believe that the evidence 
strongly supports the common belief that two persons know when 
they are looking into each other’s eyes. 

d 


Study 2 
Method 


The second part of this investigation was designed to deter- 
mine the degree of correspondence between judgments of mutual 
eye-contact made by a participant and judgments made by an ob- 
server via a one-way vision screen. High correlations would pro- 
vide strong evidence that scores obtained by means of observers 
do indeed validly measure mutual eye-contact. Since such mea 
sures are often obtained at varying observer-participant distances, 
data were collected at three such distances: 415 ft., 8 ft., and 12 ft. 
The 215 ft. distance was ommitted because of the impracticality of 
trying to get the participant-subject that close to an observer 
located on the opposite side of the one-way vision screen. 

In this study, the two persons who had participated in Study 
1 served as interviewer and observer. Each person played each role 
ап equal number of times. As interviewer, each person was paired 
with a naive member of the opposite sex. The interview situation 
Was arranged so that the interviewer sat with his back to the one- 
Way screen; the subject faced both the sereen and the observer 
located behind it. The interviewer engaged the subject in a discus- 
sion of any topic of mutual interest for a ten-minute period, during 
which time both the interviewer and the observer recorded eye 
contact via push-buttons and the event recorder. The interviewer 
attempted to make eye-contact available to the subject at all times, 
without appearing to be staring at him. Six interviews were con- 
ducted at each of the three subject-observer distances, for a total 


of 18 interviews. The distance between the interviewer and the 


subject was approximately 41% ft. under all conditions. 
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Each interview was again divided into one-minute units, and fre- 
quency and duration scores computed for each unit. Correlations 
were computed between the scores of the interviewer and those of 
the observer. Table 3 gives the correlations for frequency scores 
and Table 4 presents the correlations for duration of eye-contaet. 

Correlations were noticeably lower here than in the first study, 
ranging from .55 to .98 for frequency scores and from .56 to .98 for 
duration scores. Variations among sessions at the same distance 
are substantial, and the average correlations are noticeably smaller 
at the greater observer-subject distances. Correlations between par- 
ticipant and observer duration scores were significantly higher 
(р < 05) at 4% ft. than at either 8 ft. or 12 ft.; correlations for 
frequency scores were significantly greater at 416 ft. than at 12 ft. 


TABLE 3 
Correlation between Frequency of Eye-Contact Recorded by an Observer and Reported 
by a Member of the Dyad 
Distance between Subject and Observer 
ee 
Male 97 78 2 
Subjects .85 -87 E 
64 .88 55 
Female .66 -58 s 
Subjects .98 -93 = 
99 79 .80 
Mean .90 EN. 
TABLE 4 end that 
Correlation between Duration of Eye-Contact Recorded by an 


Reported by a Member of the Dyad 
s 000 0 0: 05. = ee 


Distance between Subject and Observer 
a 
En 
Male .92 .68 
Subjects .96 97 € 
70 82 5 
Female .98 .67 . 
Subjects 75 E 0 
95 # L- rim 
Mean 92 84 5% 


9M EDUCATIONAL AND PSYCHOLOGICAL MEASUREME!} T 


(p < .05) and differences between correlations at 414 ft. am | 
approached significance (р < 08). i 
One plausible hypothesis to account for the variability from 
sion to session might be that validity of judgments incres 
practice; however, the rank order correlation between sessi 
and observer-interviewer correlations was not significant fo 
duration or frequency scores. It might also be suspected tha 
subject sometimes looked into the eyes of the interviewer wh 
the interviewer was looking elsewhere. If so, the interviewer wou 
not record a contact whereas the observer would. This could s > 
count for both variability and the relatively low interviewe 
observer correlations. Two bits of evidence mitigate against Û 
interpretation. First, the interviewer attempted to provide 60 
stant eye-contact availability. Second, the mean frequency si 
duration scores of the interviewer were almost identical to those 0 
the observer, whereas the above interpretation requires that me 
observer scores be higher than mean interviewer scores. 
Therefore, the variability appears to be due to uncontrolled 
variations in the situation from session to session. Since the pr 
cedures followed were made as uniform as possible, it seems PIO 
able that similar uncontrolled variations occur in other studies 0 
eye-contact. Д 
The easy explanation of decreased validity of judgments Wi 
increased observer-subject distance is that judgments become more 


difficult with distance. Unfortunately, there was no good шей i 
for testing this hypothesis. 


Implications for Research 


There are three main conclusions that ean be drawn from this 
research: 4 
1. An individual can judge accurately when he is looking 
another’s eyes. This conelusion is inferred from the finding 
members of a dyad show high agreement concerning mutual eyi 
contact, 

2. The validity of observer's judgments of eye-contact 
considerably from session to Session, where validity is infe 
from agreement between observer and participant. К 

3. The validity of observer's judgments of eye-contact 
with the distance between the Observer and the subject. 
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The first of these merely verifies a common belief about an in- 
dividual's ability to know when he is making eye-contact with 
another person, and provides a basis for evaluating the validity 
of the observer's judgments of eye-contact. The second two, how- 
ever, have important implications for most research using eye- 
contact as a dependent variable. Since the validity of observer's 
judgments varies from time to time, it is incumbent upon the in- 
vestigator to demonstrate that the particular measures that he 
uses are reliable and valid. Even if this variability is limited to the 
two particular observers that were used in this study, this means 
that not all trained observers are equally accurate in their judg- 
ments of eye-contact. Hence, each investigator must demonstrate 
that his observer(s) can make valid judgments under the conditions 
of his investigation. It is not sufficient to cite the results of previous 
studies using different observers. 

The consequences of differential validity at different observer- 
subject distances apply primarily to studies using distance as an 
independent variable. The fact that measures of eye-contact are 
not equally valid for all such distances must be taken into account 
in interpreting the findings of such investigations. 
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COMBINING THE IPSATIVE AND NORMATIVE 
APPROACHES IN SELECTION VALIDATION 


MARGARET A. HOWELL 
Department of Health, Education, and Welfare 


Tue traditional model of selection validation is that of a predictor 
validated against a criterion, a measure of job performance. Elabo- 
tations of the model include multiple correlation involving more 
‘han a single predictor and canonical correlation which deals with 
multiple criteria as well as multiple predictors. 

Basic to the classical validation model is the concept of reliability 
of measurement in which it has been assumed that an obtained score 
з composed of the true score plus random error variance (Guilford, 
1954, p. 349). Increasing evidence, such as research on moderator 
variables, suggests that instead of random, the error variance is sys- 
tematic and errors of prediction are predictable (Ghiselli, 1963). A 
moderator, though based on recognition of nonrandom error variance, 
torts individuals into subgroups to which the classical validation 
зет is then applied. Subgroup sorting involves assigning individ- 

to “types” for selection purposes. 

Rather than the use of е variable based on normative 
measurement, the ipsative approach has become associated with 
‘YPologies (Block, 1961, p. 16). Ipsative measurement is allied with 

idiographic rather than the nomothetie view of psychology (an- 
Port, 1961, p. 8). The Q sort, based on descriptive statements, is like 
її ipsative item of the forced-choice kind in that the focus is on the 
"^Y in which traits are organized intra-individually. The Q -—- 
"üque, involving correlations among individuals, is the correlational 
“ounterpart of the ipsative item. In the application of the Q tech- 
“que, the correlations among pairs of individuals represent а re- 
“endardizing of the standard scores derived from the normative 
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approach to reflect score deviations around individual means of ses. 
Factor-analysis of a Q correlational matrix allows the identification 
of types in that individuals of the same factor type display а simile 
intra-individual pattern of traits even though the level of the trai. 
in a normative sense, may differ. | 

One author has warned against confounding ipsative and norms 
tive measurement (Broverman, 1962, p. 295). It could be argued, 
however, that the only meaningful intra-individual variability, other 
than a mere ordering of traits, must first be expressed in terms of 
normative measurement. Even with ipsative items, this would sug- 
gest the items might be scaled normatively prior to their ips 
tive use, | 

Proposed combined model—For selection purposes, perhaps the 
“saliency” of traits within the individual as well as inter-individus! 
variability needs recognition. This would mean that the classics! 
selection model might be replaced by one combining the ipsative 
and normative approaches. The concern in validation would be prè 
dicting either for the individual or for the type and not “on tbt 
average” as is represented by the traditional validity coefficient. In- 
stead of a regression model in which the same weights are applied te 
all individuals, weights would vary by typology. Q analysis of mul- 
tiple criteria affords a means of arriving at the subgroups or type 
for which different weights or even different predictor variables 
might be used for validation purposes. Validity could be determined 
in terms of profile similarity between N predictor and N criterion 
measures for an individual, although there are technical dificulti 
in the use of measures of profile similarity (Guion, 1965, p. 174)- 
For each subgroup, an average index of profile similarity betwee 
predictor and criterion profiles could be obtained. 

If the combined ipsative and normative selection model tested out 
adequately in both concurrent and predictive validation studies, й 
use in practice would be similar to administering a selection-plac® 
ment battery. On a new group of applicants, scores on all predic- 
tors would be obtained. Various profiles relevant to the identified 
criterion types would be developed on each applicant. The profile 
Yielding the best match with a particular subgroup would indicat? 
appropriate placement for the individual. If, for example, Q anal- 
ysis of criterion measures identified a subgroup of low speed and 
high accuracy typists who were also identifiable from a subset of 
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predictor variables, an applicant fitting this type could be placed in 
۾‎ position requiring this intra-individual organization of (тайа. 

Implications of the model—This model for selection validation 
awumes individuals are unique, genetically and behaviorally, but 
tan be subgrouped into typologies, where a type has meaning only 
to the extent it is useful for measurement purposes. With sufficiently 
müned and comprehensive measurement, the subgroup N would 
equal 1! 
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VALIDITY AND LIKABILITY RATINGS FOR THREE 
SCORING INSTRUCTIONS FOR A MULTIPLE-CHOICE 
VOCABULARY TEST 


CARRIE WHERRY WATERS 
Center for Psychological Services 
Ohio University 


LAWRENCE K. WATERS 
Ohio University 


TyrrcaL instructions for discouraging examinees from guessing on 
multiple-choice tests indicate that some fractional amount will be 
subtracted for each wrong answer. The intent is to encourage ex- 
aminees to omit items they do not know. Recently, Traub, Ham- 
bleton, and Singh (1969) suggested the addition of some fractional 
number of points for omitted items as a more direct and effective 
method of eliciting the desired test-taking behavior. They presented 
| evidence of higher inter-form reliability, and an inerease in the num- 
ber of omitted items, for the “reward for omits” instructions as 
compared to the “penalty for wrongs” instructions. 

It also seems reasonable that more examinees would prefer the 
direct positive reinforcement of omissive behavior approach to the 
less direct negative reinforcement approach. The purpose of the 
ue Sent study was to compare these two approaches and а third 
Tights only" scoring method in terms of (a) how well each method is 
ked by examinees and (b) the correlations of scores obtained under 
ach method with test and course performance criteria. 

Method— Subjects. A total of 72 students (48 females and 24 
Tales) from two educational psychology classes participated in the 
Study during a regular class period. The subjects were divided into 
three groups of 24 examinees each for the purposes of the study. 

Tests. Three matched vocabulary tests of 15 items each were con- 
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structed from an item pool for which p-values were available. The 
corresponding items in the three tests were matched on difficulty to 
within + .05. The item difficulty distributions were skewed with 
more items in the higher difficulty range. For each item, the exam- 
inee was required to choose one of the five alternative words which 
was most nearly opposite in meaning to the stem word. 

Instructions. All examinees were given the same instructions for 
answering the items. In addition, three sets of instructions concern- 
ing scoring of the tests were given. The scoring procedures (in terms 
of the weights for rights, wrongs, and omits respectively) were 
1, 0, 0; 1, —14, 0; and 1, 0, 14. These scoring procedures were 
chosen to represent the no penalty for guessing condition, the con- 
ventional correction procedure for random guessing on five alterna- 
tive items, and the procedure suggested by Traub et al. (1969). Each 
of the scoring procedures was presented to the examinee in a ta- 
ble showing the number of points gained or lost for a right, wrong, 
or omitted item. Following the table, a further explanation of the 
partieular scoring procedure was given (e.g. "That is, for each cor- 
rect answer you get one point, for each wrong answer you lose 1⁄4 of 
а point, and an omitted—unanswered—item does not count for or 
against you"). The instructions for each test were given on a sep- 
arate page immediately preceding the test items. 

Test booklets. In order for all examinees to respond to items under 
each of the three instructions, a 45-item “Verbal Ability Test” book- 
let was constructed. Each section of the booklet consisted of one of 
the three instructions paired with one of the three 15-item tests. 
A 3 X 3 Latin Square was used to construct the booklets (rows — 
groups, columns — instructions, and cells — 15 item tesis). Al- 
though not analyzed, the three instruction-test combinations for each 
group were given in all six possible orders to an equal number of 
examinees. 

The last page in each booklet was an evaluation sheet. Examinees 
first ranked (1—most liked, 3—least liked) ‚ then rated on a 9-point 
scale (9—liked very much, 1— disliked very much), how well they 
liked each section of the test booklet. 

The 72 test booklets were shuffled before being distributed to the 
students. No time limit was imposed and all examinees finished 
within the class period. 


Results and discussion —Liking rankings and ratings. The mean 
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ranks for the rights only, % of a point for omits, and —М of a 
point for wrongs instructions were 1.72, 1.82, and 2.46 respectively. 
A Friedman x, computed on the ranked data across all 72 exam- 
inees was 23.74 (p < .001, df = 2). Thus, in terms of relative pref- 
erence, the —14 for wrongs instructions were consistently ranked 
lower than either of the other instructions. 

A summary of the analysis of variance of the ratings is given in 
the left-hand columns of Table 1. The only significant F was for 
Instructions. 

Using the Newman-Keuls procedure, comparisons among the in- 
structions’ means indicated that the rights only (4.89) and % 
for omits (4.67) instructions were rated significantly higher than 
the —14 for wrongs instructions (3.36). These data support the use 
of a procedure of allowing some fractional number of points for 
omitted items, rather than the conventional procedure for discour- 
aging guessing on multiple-choice tests. 

Number of omitted items—An analysis of variance of the num- 
ber of omitted items (summarized in the right-hand columns of Ta- 
ble 1) indicated that both Groups and Instructions were significant. 
No explanation can be offered for the difference among the groups 
(Group 3 omitted significantly more items than Groups 1 and 2). 
A comparison of the means for the three instructions using the New- 
man-Keuls procedure showed that the 15 for omits (6.29), and- 
lor wrongs (6.42) instructions resulted in significantly more 
items than the rights only instructions (3.00). Й 

Because the Ms omits for the rights only instructions 


TABLE 1 ў 
Summary of Analysis of Variance for Ratings of Liking and for № of 


ooo o oa 


RATINGS Р Р 
Source df MS F 
Between Exami 1 * 
Groups (Q) 68 is 10.17 1.27 dd is 6:19 
Wie 69 7.98 
xami 144 20.29* 
Instructions (Т) 2 "nd ORA M yay 
ests (T) 2 3.94 1.08 80 <1 
(xT) 2 1.72 <1 lá z 
Error, 138 ی‎ 
Ee оо E NE 
*» « 01, 
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seemed somewhat high, а comparison was made of the 
omits under these instructions when the instructions were for 
first, second, and third sections of the booklets. When the rights 
instructions were for the second or third sections, almost 
many items were omitted as when the instructions were for the 
section of the booklet. It is hypothesized that an omissive 
from either or both of the instructions designed to elicit omits, 
ried over to the rights only section when it followed either or both of 
the other instructions. This effect was not found for the other two 
instructions. ; 
Correlations with test and course performance criteria—To со 
pute the correlations involving scores obtained under each of һе 
instructions, the three matched 15-item tests were considered as 
equivalent forms. Three right scores (one for each instrue- 
tion) and two formula scores (R + %4 O and R — М W) wen 
computed for each examinee. Since the formula scores correlates 
95 with the rights scores obtained under corresponding instruetio 
only data for the rights scores are reported. The scores obtained un 
der “rights only,” “Y4 of a point for omits,” and “— 14 of а poin 
for wrongs” instructions correlated .63, .50, and .59 with SAT-V f 
= 40) and 42, 30, and 29 respectively with course grade (n 2 
70). None of the correlations for the three instructions differed ig 
nificantly from each other (p > 05). | 
In general, more items were omitted under both the bonus fc 
— and penalty for wrongs instructions than under the righi 
only instructions and no significant differences were found in the 
correlations of scores obtained under the instructions with test aní 
course performance criteria. However, the sections of the test taken 
under the bonus for omits instructions were liked better tha 
those taken under the penalty for wrongs instructions. If guessing 
behavior is to be discouraged on a multiple-choice test, the result 
of Traub, et al. (1969) and the present study seem to indicate tha 
а procedure for rewarding the examinee for omitted items would 
be superior to penalizing the examinee for wrong answers. 4 
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ADVANCED PLACEMENT SCORES: THEIR 
PREDICTIVE VALIDITY! 


PAUL S. BURNHAM лхо BENJAMIN A. HEWITT 
Yale University 


Wane Advanced Placement scores have been used increasingly 
since the inception of the program in 1952, little “hard data” evi- 
dence of their validity has been published. Developed by the 
College Entrance Examination Board, the program functions un- 
der the assumption that able students ean successfully complete 
some college courses while they are in secondary school (Black- 
mer, 1952; Casserly, 1966; Cornog, 1956; and Wilcox, 1982). Par- 
ticipating schools offer to their better students courses whieh the 
program has planned, outlined, and deseribed. These students take 
examinations which the program’s committee of readers grades on 
a scale ranging from five, extremely well qualified, to one, no f 
mendation. After participating colleges receive the scores they 
may decide to place students in advanced courses OF to award them 
college credit. 

To judge the value of this program we decided to relate the E 
scores of Yale students to criteria of ability and achievement, m 
as scores on other tests and grades in pertinent cari heh 
grams. We were disappointed to find that extensive use of АР 
Stores in course placement had resulted in the severe : E 
Чоп of many groups which had appeared initially ye "a 
Promising. For example, the class of 1963, with which we ste ted 
to Work, was comprised of 1031 matriculants; they aet 
583 AP scores, Of these, only the 149 in English and the uu 
ee CN 
aie study was partially supported by & research grant from the сне 

се Examination Board. 
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mathematics could be analyzed profitably. Preliminary work re- 
vealed that regardless of the score they achieved, the students tak- 
ing the AP examination in English and/or in mathematics were 
initially more able and produced better as Yale freshmen than the 
average matriculant. 

To obtain more evidence we accumulated the same kinds of 
information for the class of 1967, where we had a study group of 
1027 students; 500 had presented 930 AP scores in the seven ex- 
amination fields appearing sizeable enough to study. 

Ability of scores to differentiate performance level—In courses 
in the field of the AP examination the average grades of AP stu- 
dents rose as their AP scores increased. Using class of 1967 data, 
we accumulated all of the freshman and sophomore course grades 
of the AP students in each field except physics; the latter num- 
bered too few students to warrant this kind of analysis. Next 
these grades were averaged after they had been grouped according 
to the field of the man’s examination and his score. 

Rankings by magnitude of the AP scores and of the average 
college grades in that field corresponded highly with only minor 
exceptions. These findings were particularly impressive because the 
data included grades of many high-scoring students who took 
advanced courses initially and even higher-level courses later; in- 
cluded too, were grades of students who took only the first course 
offered by a department, Thus, these mean grades reflected the 
performance of students in courses at various levels of advance- 
ment and complexity and with disparate content and grading 
standards. 

Multiple correlation evidence of the predictive power of scores 
—Of all the courses in Yale College, only the six shown in Table 
1 had enough students with AP scores of one through five to jus- 
üfy a multiple correlation analysis involving the student's term- 
course grade as the dependent variable and his SAT-V score, his 
predicted freshman-year average, and his scores on relevant CHEB 
achievement and aptitude tests аз the independent or predictor 
variables. The latter are indicated in Table 1. By adding the AP 
score to the other independent variables, we found a slight in- 
crease in each of the six multiple correlations, While the samples 
are small, the consistency with which the AP scores made a useful 
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TABLE 1 
Multiple Correlation Data, Class of 1967 


Predictor Variables in Multiple Correlation 
Addition to SAT-V and with Grades in Yale 
Course* Matriculation Prediction> Courses 


Excluding — Including 
AP Scores AP Scores 


English 24 CEEB English 18 .22 
(n = 35) AP English 

English 25 CEEB English E: .38 
(п = 87) AP English 

French 41 CEEB French 46 49 
(n = 17) AP French 

Mathematics 10 CEEB SAT-M, @ -73 
(n = 21) Adv. Math. 

Mathematics 15 AP Math Al .50 
(n = 71) 

Mathematies 20 39 42 
(п = 30) 45 

Median Correlation .40 . 
because of the paucity of data. 


* Courses in Chemistry, History and Physics were not analyzed 


ЎА prediction of each student's probable average grade in all his Freshman year courses. 


contribution to the other matriculation data offers further evidence 


of their predictive validity. 

Ability and achievement of AP and non-AP students. In each 
subject except French Literature, students making ДР PO of 
four and five were superior to other AP students in ability as 
Measured by the mean of each student’s CEEB soo aes 
in all seven AP subject-matter fields, the group with = улар 
Proved to be superior in their ОЕЕВ records to students who 4 


attended “AP schools” but did not take that particular АРИ 


АР st i i e freshman and sophomore 
udents attained higher averag e who attended 


grades within the field of the AP test than 8 

“AP schools” but did not take the test. As one might cie ss 
high-scoring AP men made the more impressive recor ran 
field of the examination; only in physics was this nob 1 


Table 2 provides supporting data. 
s Matched-growp comparisons—The ext 
Work can be completed effectively in second: 


Inferred by comparing the grades of AP freshmen assigned to an 


advanced i -AP sophomores having com- 
course with those of non rolled in the same 


Parable ability in terms of CEEB averages, е 


ent to which college-level 
ary school might be 
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TABLE 2 
Summary ef Matriculation and College Performance Records of Class of 1967 Studealy 
AP Secondary Schools 


1 
: 
| 
8 
= 


aga 38$ 32 


{ Based om both the Aptitude and Achievement Tests taken 
* Meas grade in all college courses taken in the feld of the 


course, but entering it through a lower-level. two-term college 
Sequence. (We matched students according to the mean of four 07 
more of their CEEB scores; the SAT-V and -M comprised two of 
these, and the balance was formed by as many of the achievement 
test scores as each man had presented. Some of the means were 
based on as many as eight scores, but none on fewer than four. 
While we recognized that strict matching on the basis of the same 
four or more CEEB tests would have been more desirable, doing 
so would have fractionized our samples so much that this phasè _ 
of the analysis could not have been accomplished.) We investi- 
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gated advanced courses to which AP students were initially as- 
рзд. Only in two English courses and one French course, each 
ef two terms, and in one single-term mathematics course were we 
able to establish matched groups which could be compared by 
taking the average of their two term grades and their one grade 
& the single term course, as criteria of performance. 

Data in Table 3 suggest that the freshmen who had gained ad- 
vanced placement in English 25 and French 41 nearly "held their 
svn" with the two groups of sophomores who had been prepared 
for these courses through a year of college study. In Mathematics 
30, grades of the AP students were clearly superior to both groups 
of sophomores. This evidence, while limited to three subjectemat- 
ter fields and involving only small samples, adds weight to other 
previously presented evidence that the AP Program can prepare 
able secondary-school students to compete effectively in upper- 
elas college courses. 

In English 15 we present information about students who made 
lower AP scores and were not assigned to upper-class courses; ая 
à group they produced a bit better than the non-AP freshmen. One 
might expect them to do so because in a sense English 15 was а 
*tcond exposure to the AP work they had done in secondary school 
but presumably was new material for the non-AP freshmen. 

Discussion and conclusions. Numerous positive findings resulting 
vy the study of small populations of Yale students offer consis- 

t evidence of the validity of the AP Program. 

1. As a group, AP students appeared more able and achieved 
better than the average matriculant. They were also superior to 
*on-AP students from “AP schools.” 

2. In freshman and sophomore courses in the field of the 74 
“amination the average grades of students grouped according x 
their AP score rose as the AP score increased and thus gave 
dence of the predictive differentiation by the AP scores. ең 

3. Since predictions of performance in college eus cem 
Proved slightly but consistently by adding AP scores to шег 
‘ests of aptitude and achievement, AP tests appear to he en 
7 factors of some pertinence which are not otherwise ud 
the Matriculation records. 


i t of 
4 Comparisons of the performance of AP freshmen with tha! 
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TABLE 3 
Achievement in Four College Courses of AP and Non-AP Students of Comparable Abiy 


Average of Firal 
Four or Grade is 
More CEEB Coume 
Range of Scores 
Class N АР Scores Mean SD Mean 


English 15, Problems in Writing 


scores 
(b) Freshmen without English 
AP scores 1967 65 642 40 79.2 
English 25, Major English Poets, 
Chaucer to Eliot 
(a) Freshmen with English AP 
та часни 
college English co 1967 16 2-5 618 31 80.8 
(b) Sophomores without] English 
eic Senes after 
comple English 15 as 1967 16 619 30 81.1 н 
Freshmen 1966 16 615 29 81.3 
French 41, Introduction to French 
Literature: Seventeenth to 
Twentieth Centuries 
(а) e загад AP 
scores, taking this as first La 2-5 | 
college French course 1967 15  Lil-4 665 31 80.6 
(b) кке ie теше Я | 
scores, taking this after 
lower-level Freshman 


1967 15 660 31 


Se 
pum 


| 

тей courses 1966 15 660 34 
зао а О d Calculus | 
‚ 


Yale mathematics course 1967 21 3-5 702 33 81.8 | 


Жаны АР scores, enrolled 
ter completing mathe- 1967 21 701 33 
matics 10 and 15 as 

1966 21 696 36 


- 


ДЫ] broad ranee of grade in Mathomatic (60-9. bined ЧЁ 
5; 50-100; and 60-95 ctively) com 
flatness of the distributions, accounts for the seemingly atypical standard та зеи 


non-AP sophomores of comparable ability, in the same courses, indi- 
cate that the two groups achieved rather similarly. The AP Progra” 


appears to prepare able students to compete adequately with mor | 
advanced college students. | 
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higher education, we would urge colleges to test its validity by con- 
ducting controlled experiments under which students are assigned 
randomly to courses regardless of their AP scores and their par- 


calculation of measures of “statistical significance” which are not 
appropriate with highly selected sampl 
study. 


Blackmer, A. R. (Chm.) General ucation pé Ew Andover, 
A committee report by members of шү and Yale. Cam- 
eet Lawrenceville, Насаси К 

ridge: Harvard University ў lacement. College 

Casserly, P. L. College decisions on advanced PD evelopment Re- 
Entrance Examination Board Re Rp eere Testing 
ie RDR 6465, No. 15. Princeton: 

rvice, Jan. 1966. " vanced standing. Final 
rnog, W. H. College admission with ad luati conferences 
report and summary of the June 1955 eva € School and 
of the school and college study. Gambier: э hing, Kenyon 
folleg Study of Admission with Advan! 
. College, 1956. ollege Board 
ilcox, E. T. Seven years of advanced placement. C. 
Review, Fall 1962, 48, 29-34. 
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SAT AND HIGH SCHOOL AVERAGE PREDICTIONS OF 
FOUR YEAR COLLEGE ACHIEVEMENT 


MARVIN SIEGELMAN 
City College of New York 


SunPRisINGLY little research has been conducted on the most 
widely used college selection test, the Scholastic Aptitude Test 
(SAT) of the College Entrance Examination Board (Zimmerman, 
1965). Although the SAT has been in existence for over 20 years and 
is currently used in over 500 colleges and universities, almost all of 
the validity data have been reported by the Educational Testing 
Service (Whitla, 1965; Zimmerman, 1965). Most reports also note 
only the relation between SAT and Freshman Grade Point Average 
(GPA), ignoring the remaining three years (Frieke, 1965). The apu 
tribution that SAT makes to High School Average (HSA) in pre. 
dieting GPA is another infrequently examined area. The purpose oí 
the present research was to analyze the degree of association be- 
tween achievement (GPA) during four years of college attendance 
= (a) SAT Verbal, Mathematics, and Total Scores, (b) HSA, 
and (c) a composite of SAT and HSA scores. ч 

Method. Subjects. The subjects in the present study содні 
of students who had completed four years of study, esd 
128 credits, and graduated from the City College of New 
(CCNY). Sat, HSA, and GPA data were collected from the тера 
nent records in the CCNY Registrars office for 80 males am 
females, Although all students were interested in apap amd 
Tepresented all areas of undergraduate concentration т i 
arts and fine arts. The interest in teaching was pa эз 
Completing a sequence of educational courses that are TE 
lor teacher licensing in New York Ойу. Most students came rom 


an upper lower or middle class socio-economic background. The 
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average age at entering CCNY for males was 17.98 (SD = 
2.06), and for females it was (SD — 1.87). 

The admission criteria at CCNY consisted of an Entrance Com- 
posite Score (ECS) which included the HSA plus a converted SAT 
Verbal (SAT-V) and SAT Math (SAT-M) score. In terms of a 
priori weights, the HSA was intended to contribute 50 per cent to 
the ECS and the SAT-V plus SAT-M the remaining 50 per cent. 
The males had the following means (M) and standard deviations 
(SD) in their scores when they entered CCNY: HSA, M = 83.86, 
SD = 5.50; SAT-V, M = 502.94, SD = 83.39; SAT-M, M = 529.70, 
SD = 75.69; SAT: V +M, M = 1032.64; SD = 186.67; ECS, M = 
167.76, SD = 8.81. When the females entered CCNY their corre- 
sponding statistics for their scores were: HSA, M = 86.46, SD = 
4.34; SAT-V, М = 503.76, SD = 89.57; SAT-M, М = 504.89, SD 
=89.74; SAT: V + M, M = 1008.65, SD = 153.63; ECS, M = | 
171.28, SD = 23.42. 

Results and discussion. The most striking aspect of Table 1 is the 
exceptionally low correlations between SAT and grades for males | 
in contrast to females. The strength of relationship for male SAT 
findings, but not for the Female results, is also generally lowet 
than that for the typical SAT validity coefficients for male or fe 


TABLE 1 
Correlations between SAT, HSA, and College Grades for Males and Females* 


Females (N = 216) 


Upper freshman —01 —05 25 FER 

v sophomore 20 —09 25 20 

Ubper senior o5 M = 22 
bes pone: s GPA $ м 

wer freshman 04 00 

Upper freshman 00 —07 ^ n 

Lower sophomore 06 -п 27. 09 

Upper sophomore 10 —08 29° '23 

Lower junior ll —08 36 29 

Upper junior 24 04 93 35 

Lower senior 08 —02 43 37 

Upper senior 10 03 44 40 
English average 16 06 23 29 29 15 
Science ave 07 —01 15^. 139 29 36 n 


gHp————————',Ó)|ÁaWNw ч AS umm ss oc „у 99 
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male college freshman GPA, which usually range from .16 to .61 
(Zimmerman, 1965). 

Greater homogeneity in ability of CCNY students, especially 
of those who complete four years of college, may account in part 
for lowered validity coefficients, but the close to zero correlations 
for males probably can not be attributed entirely to this reduced 
variability. One could speculate, for example, that the males were 
less conforming than the females and tended to perform more 
according to their interests and motivations than according 
to their academic potential as estimated by the SAT. The cor- 
relations between HSA and GPA for males, furthermore, do not 
indicate especially uniform ability. The efficiency of the SAT for 
predicting GPA for males, at least at CCNY, must be seriously 
questioned. 

The relationship of HSA to GPA, on the other hand, holds up 
reasonably well for both males and females. It is comparable to 
that found in the data reported by Passons (1967) and some- 
what lower than that indicated by the Lins, Abell, and Hutchins 
(1966) findings, For males, the use of the SAT-V plus ВАТ-М with 
HSA actually lowered the validity coefficients in the 
the predictions made from the HSA alone, whereas for the females 
there was an increase in ECS over HSA used alone The median 


ECS correlation increase for females was 12. 

The predictions of cumulative GPA indicate that the accuracy 
of the HSA and SAT predictors improves for females as the num- 
ber of completed credits represented in GPA increases. The = 
panded GPA reliability might be expected to account in parte 


higher correlations. 

The findings of the present study, especially for males, call sv 
tention to the need for additional research on the эш “a 
the SAT as a criteria for admission to college. The Fond 
ет, and effort involved in securing SAT scores "eden 


sidered in terms of how much the SAT contributes to the r 


Predicting GPA. 
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THE RELATIONSHIP BETWEEN EXPECTED GRADES 
AND STUDENTS’ EVALUATIONS OF 
THEIR INSTRUCTORS 


DAVID 8. HOLMES* 
University of Kansas, Lawrence 


Tuere has been some concern about the validity of student 
ratings of their instructors’ teaching performances because the 
students may not be completely objective observers. Tt has been 
suggested that the grades the students expect may influence their 
evaluations of the instructors. Unfortunately, the research оп this 
question (e.g, Anikeef, 1953; Heilman and Armentraut, 1986; 
Hudelson, 1951; Riley, Ryan, and Lifshitz, 1950; Voeks and 
French, 1960; Weaver, 1960) has been relatively meager, produc- 
tive of conflicting conclusions, uninformative because different 
elements of the evaluation were not consi separately, and in 
many cases the research was methodologically inadequate. In view 
of the growing use and importance of student evaluations (Eble, 
1970), the present study was carried out to assess the relationship 
Which expected grades have to evaluations. 

Method. The evaluation instrument used 
sessment Blank (TAB) (Holmes, 1971). 
vided the data for the present study were 
classes surveyed in the College of Arts ап 
sity of Texas (Austin). All classes used 
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was presented earlier (Holmes, 1971). In this analysis, however, 
data from subjects who expected grades of D or F were not con- 
sidered because combined they constituted fewer than seven per 
cent of the students. The numbers of students expecting grades of 
A, B, and C were: 200, 752, and 587, respectively. The analyses 
were limited to the 18 items which make up the Instructor Presen- 
tation, Evaluation-Interaction, and Student Stimulation subscales 
of the ТАВ. 1 

Results and discussion. The responses of students expecting A's, 
B's, and C's were compared on each of the 18 evaluation items 
in each of the seven classes. The probability values associated 
with the resulting Ё values are presented in Table 1. If the prob- 
ability value associated with an item was .10 or less the item 
was considered to be related to expected grades in that class and 
is italicized in Table 1. This leniency in terms of probability 


TABLE 1 


Probability Values Associated with F's Arrived at in Comparing Responses 
of Students’ Expected Grades of A, B, and C 


eesti‘ a ннан Ёла ви 


Item Class 
A B [o D E F 6 
Student Stimulation Subscale 
*11, 08 .00 18 60 00 .92 if 
**12, .00 .09 .00 .16 00 ‚23 53 
115. .02 .00 .18 71 ‚04 .87 57 
24. .00 00 HUGE ди 00 86 1580 
“ш. ‘08 ‘08 ‘00 ‘50 00 7 B 
*29. .00 .01 .00 .00 01 .01 2% 
30. .00 14 .01 .00 00 15 " 
Inieraction-Evaluation Subscale 
*13. .00 .62 .02 32 01 23 i 
e .01 .80 2n 54 .01 .96 2 
M 184 .01 .06 .08 .65 .89 3 
220. ‘00 20 ОРОО uei 00 28 exei 
21. -00 -18 .04, .00 .00 ES! om 
23. .08 .07 59 .88 25 81 A 
Instructor's Presentation Subscale 
Т 76 .23 81 | 
8. .02 al 34 00 a 24 :0 
9. “10 ‘08 j ; i @ 
.96 51 .00 .53 
*10. .06 .70 .00 .61 .08 .52 a 
16. al 01 :22 `61 91 0 129 


* Highly consistent. 
** Moderately consistent. 
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level was compensated for by the demand for replication across 
classes which will be considered later. In 57 (45 per cent) cases the 
probability values indicated that the item was related to expected 
grades. Since fewer than 12 of these 57 values would be expected 
by chance, it is clear that expected grades were related to evalua- 
tion responses. More important, however, were the answers to 
questions of (a) whether the relationships were consistent across 
classes, (b) whether the associations were limited to any specific 
subset of items, and (е) whether the correlations were of practical 
as well as statistical significance. 

Consistency across groups. From inspection of the values pre- 
sented in Table 1 it is clear that while expected grades were related 
ber of items in classes A through 
E, the expected grades were for the most part not related to the 
responses of the students in classes F and G. In fact, classes A 
through E averaged 112 related items, whereas there was only 
one related item in each of classes F and G. It is, therefore, ap- 
parent that there were clear and important intergroup differences 
in terms of whether or not expected grades were related to evalua- 
tion responses. The question of the “boundary conditions” which 
determined whether expected grades influenced evaluations will 
be discussed later. Because it is clear that classes F and G were 
beyond the limits of prediction, in fature analyses only the daia 
from classes A through E will be considered. 
ms were found to be related to ex- 
Pected grades in four of the five classes being considered. These 
items will be referred to as highly consistent in their relationships 
to expected grades. Another group of five items was found to be 
related in three of the five classes, and these items were classed as 
moderately consistent in their relationship to expected re 
These two groups of items accounted for 82 per ар боа i 
cized probabilities for classes A-E noted in Table it p usc 
the Probability level used, most of the remaining italicized proba- 
bilities could be attributed to chance. Tt, therefore, can pe Pd 
cluded that there were a number of items that were consistently 
Telated 

to expected grades. stems making UP the 


Content of const E All n 
sistently related items. | 
Student Stimulation subscale were consistently (five highly, two 


to the responses given to a num 


In the present study seven ite 
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moderately) related to expected grades. Students who expected 
higher grades reported that they paid more attention (No. 11], 
felt more challenged (No. 12), were more stimulated (No. 15), were 
more interested (No. 24), looked forward to attending class more 
(No. 25), made more of an effort to learn (No. 29), and learned 
more (No. 30) than did students who expected lower grades. The 
responses to four of the items from the Interaction-Evaluation 
subscale showed a consistent (two high, two moderate) relatione 
ship to expected grades. The analyses indicated that students who 
expected higher grades reported the instructor had more adequate 
evidence on which to base their grades (No. 20), the grading 
system was fairer (No. 21), they felt freer to ask questions and 
disagree (No. 13), and they thought that the instructor returned 
assignments more promptly (No. 19) than did students who е 
pected lower grades. Interestingly enough, the students' evaluations 
of the instructors' fairness in dealing with students apart from grad- 
ing (No. 14) and the instructors’ interest in students (No. 23) we 
not found to be related to expected grades. In sharp contrast to Ш 
items measuring the degree to which students were stimulated an 
their evaluation of the grading process are the items on 
Instructor Presentation subscale, for none of these was hig 
consistent in its relationship to expected grades while only 
item measuring the degree to which the student felt the instructo 
Was aware of whether the class was following him (No. 10) 
moderately consistent in its relationship to expected grades. 

In summary, it appears that students who had expected lowei 
grades reported less personal involvement in the course than did 
students expecting higher grades but, contrary to what might Û 
expected, students anticipating lower grades were not mor 
critical of the instructors’ presentations than were stud ents 
expecting higher grades. In other words, it seems that when 8 
student expected a low grade he did not become critical of th® 
instructor per se but rather he seemed to accept the blame himse 
and in effect said, “I didn’t do better because I didn't become eX- 


enough.” The only externalization of blame noted in these results 
involved the criticism of the evaluation system, but because th 
classes which were used for this analysis were all large ones (1 
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students) in which multiple choice examinations played а pre- 
dominant role in determining grades, the evaluation system жаз 
an easy and very possibly a justifiable target for criticism by 
students expecting low grades. 

It cannot necessarily be assumed from these data that the low- 
ered scores on the Student Stimulation subscale items of students 
who had expected low grades resulted from a defensive distortion 
which they used to explain their relatively poor grades. It may be 
that students who had anticipated higher grades were more intel- 
ligent than those who had expected lower grades, and because of 
superior ability they were able to benefit more from and be stim- 
ulated more by the instructor. With regard to this possibility, it 
should be noted that in every class there was a significant differ- 
ence in the actual grade point averages of students expecting A's, 
B's and C's. If ability influenced both expected grades and the 
degree to which a student was stimulated, the relationship be- 
tween expected grades and evaluation responses would not be а 
threat to the validity of the responses but would instead indicate 
that students of different levels of ability were differentially influ- 
enced by the instructor. Whether this circumstance is the case 
generally will have to await further research. 

Practical importance of relationships noted. While the above 
results are of statistical significance, it must be asked whether these 
relationships might have any practical importance in terms of the 
potential influence they exert on evaluation items. To answer this 
concern, within each class, correlations were — 
pected grades and the responses to the seven items whieh were 
{о be highly consistent in their relationships to expected grades. 
Squaring these values indicated the amount of shared variance. 
The mean amount of variance shared by expected grades vet 
sponses to consistently related evaluation items was only E 
рег cent. In only four of the 35 cases was more Mo n 
of the variance shared, and їп no case Was more than 13 per а 
Shared. From these results it is clear that while there was а e the 
of items which were consistently related to expected os 
Potential influence of this relationship was negligible. Saar 

Boundary conditions. A number of variables (eg, overall eval- 
uation and expected grades) were used to Compare =e 


Which the evaluation responses were and were not related 1o ex” 
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pected grades. Since none of these comparisons yielded interpre 
table results, it is not possible at the present time to specify the 
characteristics of classes in which the relationships will and wil 
not be found. However, the above suggestion that the relationship: 
between evaluation responses and expected grades might be 
ated by the students’ ability offers one possibility. It may be 
in those classes in which no relationships were found, the instr 
tors had directed their lecture material at the C level stud 
and, therefore, given this low level of approach, the instruct 
presentations might not have differentially stimulated students of 
different ability levels. 
Conclusions. In most classes relationships existed between We 
grades students expected and the degree to which they ted 
they were stimulated by the instructor and the degree to 
which they felt the grading system was fair. On the other 
hand, items assessing the instructors’ presentations were not fou 
to be related to the expected grades. Thus it did not appear ti 
expected grades were related to a general halo effect. Unfortu- 
nately, these data did not indicate whether the lower levels of 
stimulation reported by students anticipating lower grades was 8 
defensive reaction or whether students expecting higher grades 
were more intelligent and, therefore, were in a better position to be 
stimulated by the instructors’ presentations. While the nature 
antecedents of the observed relationships were of considel 
theoretical interest in terms of both the evaluation process and the 
possible differential effect of instructors on different types of stu- 
dents, it is clear that the relationships did not severely distort th 
evaluation process and therefore did not pose a serious threat to WE 
validity of the evaluation system. 
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FACTOR ANALYSIS OF 1970-71 VERSION OF THE 
COMPARATIVE GUIDANCE AND PLACEMENT 
BATTERY 


JOSEPH GRIMALDI 
Marymount College, Tarrytown, New York 


EUGENE LOVELESS, JAMES HENNESSY, ax» JOHN PRIOR 
Queensborough Community College 
The City University of New York 


Tue Comparative Guidance and Placement Program (CGP) is 
а multi-purpose battery published by the College Entrance Ex- 
amination Board (CEEB). The battery, which includes biograph- 
іса], interest, ability and achievement measures, was designed pri- 
marily for use at community and two-year colleges. Lunneborg, 
Greenmun, and Lunneborg, (1970) reported a factor analysis of 
the 1967 version of CGP. However, the 1970-71 version of the 
CGP differs in composition from the 1967 version. Consequently, 


Purpose and procedure. The 1970-71 version of the CGP bat- 
tery served as the basis of the present factor analytic investiga- 
tion. The battery was administered to the freshman class of 
Queensborough Community College enteri in the Fall of 1970 
(N = 1637). The battery was scored by Rdueasional’ 1s 
Service (ETS). The 11 interests scales and the eight —" 
scales were then factor analyzed through utilizing а poss 
components solution followed by rotation to Ree ое 


ion. Unities were placed in the diagonal cells of сей шы ч Ж 
relation matrix, All principal components that had 8 ro 


eigenvalues equal to or greater than one were 
tion, 
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Results. Table 1 contains the test, intercorrelation matrix. Ded- | 
mal points have been omitted. | 

Table 2 furnishes the rotated factor solution. Communalities 
as well as eigenvalues have been reported. Decimal points have 
been omitted except for the eigenvalues. The six-factor solution 
accounted for 70 per cent of the total variance. Factor I was pri- 
marily defined by the cognitive scales with the exception of Mo 
saics. Factors II, III, and IV were primarily (although not 
completely) defined by interest scales. Biology, Health, and Phy 
ical Sciences defined Factor II; Secretarial, Business, Home Eeo- 
nomics, and Academie Motivation were loaded on Factor Ш; 
and Mathematics, Physical Sciences, Engineering Technology and 
Social Sciences were weighted on Factor IV. Factor V was ale 
described by cognitive scales—specifically, Mosaics, Letter Groups 
Mathematics, and the Year 2000. The remaining interest scales 
defined Factor VI. 

Discussion. Factor I has been interpreted as a scholastic apti- 
tude factor. It apparently encompassed the ability to cope with 
school tasks (ie. reading, vocabulary, sentences, mathematics) 
as well as the ability to follow verbal directions and solve a prob- 
lem (Year 2000). The Letter Groups test also was loaded on this 
factor but at a slightly lower level (45). The latter finding і 
consistent, since the Letter Groups test is a more nearly purè 
measure of general reasoning ability than are the other tests thst 
were loaded on this factor. Hence the Letter Groups test was les 
susceptible to scholastic influences than were other measures. 

Factor П was primarily defined by the Biology and Health in- 
terest scales. It seemed to represent both theoretical and applied 
aspects of the Biology-Health domain. Factor III was primarily 
defined by the Secretarial and Business interest scales and by ® 
Academie Motivation scale. This pattern of loadings suggested 80 
interpretation of this factor, viz., Practical Interest. The factor 
bore a similarity to the Practical Outlook scale of the Omnibus 
Personality Inventory (Heist and Yonge, 1968). Factor IV WÊ 
primarily defined by the Engineering Technology interest seal 
An interpretation of this factor as Interest in Technological 8" 
ence would appear consistent with the presence of the other varis 
bles that also were weighted on this factor, viz., Physical Scient* 
and Mathematics interest scales. 
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TABLE 2 


Primary Factor Loadings of CGP Variables for 1970 QCC Freshman 
Class Using the Varimaz Solution 


-01 06 з n 

Phy. Se. 05 42 —08 79 13 12 
Eog. -06 00 -0 78 -00 `2 
Bio. 09 $4 -o3 26 06 з 
Health -02 89 08 06 01 07 
Home Ec. 01 43 56 -27 -17 76 
-06 -0 9 -04 -1 05 

Bus. 01 —06 84 28 19 05 
Soe. Se. 1 -0 2 s 38 37 
Fine Arts 03 19 13 05 —16 8 
Music 05 09 -0l 13 0 78 

Cognitive Scales 

Head. 0 m — о 02 3 
Verbal 9 0 -6 o 0 0 
Sentences 74 07 05 -21 —27 08 
Math 63  —03 -0 2 -—43 -15 
Yr. 2000 68 —05 —05 02 —40 —02 
07 

00 


marily defined by tasks of nonverbal reasoning nature (e.g. Math- 
ematies, Year 2000, and Letter Groups) as well as by a measure 
of perceptual efficiency (Mosaics). Tentatively, this factor might 
be defined as a Perceptual-Reasoning factor. However, the inter 
relationship of the tests that were loaded on Factor V as well as 
their relationship to external criteria bears further investigation. 

In summary, the CGP battery was found to yield six interpre 
table factors. Two of these factors were related to the cognit 
tests and four were related to the interest measures. The patt 
of loadings split very neatly according to content (ie., in 
vs. cognitive). It would, therefore, seem more profitable if 
factor analytic studies divided the battery into portions prior 
factoring. The only measure in the battery that could not | 
equivocally be placed in one or the other category was the AC" 
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CORRELATES OF А PASS-FAIL DECISION FOR 
ADMISSION TO CANDIDACY IN A DOCTORAL 
PROGRAM IN EDUCATION 


WILLIAM B. MICHAEL, ROBERT A. JONES, HUDHAIL 
CALVIN M. PULLIAS, MICHEL JACKSON, axe VALERIE GOO 


University of Southern California 


Prior to admission to candidacy in the doctoral program of the 
School of Education at the University of Southern California 
(USC) students have been required not only to complete the 
ptitude portion of the Graduate Record Examination (GRE) 
but also to take the Comprehensive Examination (CE), а battery 
of five objective examinations—one two-hour test in each of the 
three fields of Administration, Psychology, and Social and Phi- 
слан Foundations, and two one-hour tests їп Curriculum 

ementary Education and Secondary Education or Elementary 
Education and Higher Education or Secondary Education ond 
Higher Education). Within two to four weeks after а student com- 
es the CE, the Doctoral Committee consisting of approxi- 
mately ten full-time faculty members reviews cumulative folders; 
*valuates examination data, prior grades, letters ot fon 
Чоп, and professional experience and objectives; interviews the 
‘andidate; and renders a pass or fail decision regarding his readi- 
“ess to be admitted to the doctoral program. 

CAM investigation was to explore tbe 
м. "pose. The purpose of the investigation еви 
3 +) decision for & sample of 

Students (694 men and 150 women) at USC who aa 
Е from July 1968 to October 1970 (а period ‚ энине 
itteen administrations of the CE), and of pum ace dged 
lore the Doctoral Committee. Such information was ]% 
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to be of considerable importance in evaluating tbe validity of 
examinations in the admission of students to candidacy. 

Findings. Calculated through use of a computer program to 
allow for missing data for a number of students, the sero onder 
coefficients of correlation in Table 1 point to the following major 
findings: (1) Moderate intercorrelations among the five parts of 
the CE varied from 32 to .57. (2) Low to moderate validity 
coefficients ranging from .11 to 49 were registered for the part and 
total scores of the GRE. With respect to both separate and total 
CE scores, higher validities appeared for the Verbal (V) than for 
the Quantitative (Q) scores, the one exception being for the CE 
in Administration. Higher validities were found for the GRE 
scores relative to the foundation fields of Psychology and of So- 
cology and Philosophy than relative to applied fields of Curricu- 
lum and Administration. (3) With respect to the pase (ай erite- 
rion part scores of the CE showed validities ranging from 49 te 
59; and the total score, a coefficient of .70. (4) The GRE total 
and part scores exhibited negligible to slight with the 
pass-fail criterion as evidenced by coefficients varying between —13 
and 27. (5) Sex membership revealed negligible correlation 6 
ficients with CE scores as well as with GRE scores and a coefficient 
of only 01 with the pass-fail criterion. 

Conclusions. For eu sample of 844 graduate students seek- 
ing admission over slightly more than a two-year period as can- 
didates for the doctorate in the USC School of Education, part 
and total scores on the GRE were negligibly to modestly 
to performance on five objective achievement tests comprising 
the Comprehensive Examination (CE). Admission. te этч 
based on a pass-fail decision of & doctoral ew 
“antially dependent on total scores on the CE, slightly А ia 
tibly related to GRE scores, and virtually 
єх of the candidate. 
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APPROPRIATENESS OF SUBTESTS IN ACHIEVEMENT 
TESTS SELECTION 


THOMAS M. GOOLSBY, JR. 
University of Georgia 


AvTHORS and publishers of extensive achievement test batteries 
have usually presented validity evidence of “content” and/or “сиг- 
ricular” types. This validity evidence has been quite appropriate, 
since a criterion considered to be better than the battery and/or 
its subtests would be extremely difficult to identify or define. 

Considerations of relationships between total scores and inter- 
| relationships among subtests for any two achievement batteries 
should be given particular attention by those responsible for 
achievement test selection. For extensive achievement batteries, 
correlations of total test scores on any two will approach .90 and 
many times exceed that value. It is possible for the zero order 
relationship of total scores on two batteries to approach unity 
} (1.00) when one considers the very high reliabilities of batteries 
approximating 95. On the basis of the .90 magnitude of relation- 
ship for total scores, one might decide that one battery is just аз 
appropriate to use with a given population as another. This de- 
cision is probably correct if one is not particularly interested m 
determining subject area emphases for follow-up activities or cur- 

ticulum guidance values of different achievement test basenon. 
Users of achievement test batteries are generally more interested 


in follow-up activities and in curriculum guidance features of 
k ordering Or placement 


The present study pre- 
{ two achievement test 


achievement tests than in a simple ran 

of students on the basis of a total score. 

sents evidence of internal uniqueness 0 

batteries important to diagnosis and guidance. 
The present study was designed to 
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1. determine the degree to which like named components (seh 
ence, mathematics and so forth) of two achievement tet 
batteries measure the same things. 

2. ascertain the correlation of total test scores for the two 
achievement test batteries. 

3. derive some information regarding the rational appropriate: 
ness of achievement test batteries for a given population 

Procedures. The two achievement test batteries considered wem 
the Metropolitan Achievement Tests (MAT) and Stanford 
Achievement Tests (SAT) designed for measurement of outcomet 
of instruction at the same grade level. Selected subtests which are 
named exactly the same in both batteries and others which are 
similarly named were considered. 

The population used was a large junior high school in a subut- 
ban area of greater metropolitan Miami. The MAT and SAT were 
administered to the seventh grade students in the fall. Intercorre 
lations among and between certain of the measures and betweem 
total scores for the two batteries were obtained. Means and stan- 


Results and discussion. Total scores on the MAT and SAT co 
lated 89. At this level of relationship, it can be said that the 
batteries measure essentially the same characteristics. This © 
come was expected. Even though these two batteries appareny 
measure similar components, the evidence that follows sho 
marked internal and practical differences important to achiev 
ment testing, | 

Table 1 presents relationships between certain subtests of MAA 
and SAT. The relationships corrected for attenuation betwe 
similarly named subtests in MAT with those in SAT had а range 
of 49 to 69. The differently named subtests had a range of 3940 
57. These ranges are, in general, typical of interrelationsh! 
corrected for attenuation for subtests in a single achievement 


for which this range is not typical are subtests in a battery 108 
special purposes such as vocabulary or special subject aren COV- 
erage known to be highly related to other components but deemed 
important to include for special curricular emphases. 

The relationships in Table 1 for similarly named subtests in © 


THOMAS M. GOOLSBY, JR. т 
TABLE 1 
Relationships between Certain Subleats of MAT and SAT (N = Зу 
[E GA 2 ÉÉÉ——— „„— 
Variablen? 1 2 3 4 5 
K м) i 0 ш wm 
58) (51) 
ZEct — 3à)—w 
9 
: (49) (58) ri Hi - 
(44) (85) 
L “4 47 EJ E 
(44) (50) (51) (0) 0) 
10 49 55 57 ^ LJ 
(56) (64) (70) 67) - 
E e o o 


* Decimals omitted. 
> 1 BAT-Arithmetic Computation. 
2 BAT-Arithmetie Concepts. 
3 SAT-Arithmetic Application. 
4 SAT Science. 
$ SAT-Social Studies. 
Є MAT-Arithmetic Computation. ЕЈ 
7 MAT-Arithmetie Problem Solving. 
3 MAT Science. 
9 MAT-Social Studies Information. 
„МАТ Social Studies Skills. 
order correlations. 
"Jaro order correlations corrected for attenuation are inserted within paremibanin. 


‘wo batteries are probably somewhat surprising to many of thore 
*ho frequently deal with practical measurement problems such as 
achievement battery selection to measure outcomes of instruction, 
Papil progress, and uniqueness of a given curriculum. The evidence 
Presented here is supportive of the fact of uniqueness of curriculum 


TABLE 2 
Means and Standard Deviations for Certain MAT and SAT Swhtexts 
(N = 335) 
س‎ 
Tut, Descriptive x dedi E NS sp 
1hmetie Computat 41 . .36 
Arithmetic GÊ e 40 82 "- i" 
“Arithmetic Application BITE BR Mua 
Т-Зос Studies 92 -89 С 8.88 
AT-Science 60 8s 25.06 804 
MAT Arithmetic Computation 45 E) си 8. 1 
MAT ga hmetie Problem Solving 2 = 27.08 9.85 
108 : 9.91 
MAT Social Studies Information oO NET en 
{Number of items, 


E 
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split test reliability coefficient estimates reported by the 


wn EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 3 


Multipla Correlations Tarolring Curtain MAT ond SAT SUN 
( - 


areas as measured by two achievement batteries designed to mes 
ure outcomes at the same grade level. The evidence could be ii 
terpreted as being supportive of the need for a variety of bai 
constructed to fit different curriculums by subtests in both conte 
and/or method. 

Table 2 presents the means and standard deviations for certs 
for certain of the MAT and SAT subtests. In general, the MAT 
more appropriate for the population of this study. The SAT 80 
tests presented аге, in general, somewhat too difficult. The reade 
should not, however, depend too much on these values. The Mi 
was selected at an earlier date for the population of this study | 
analysing the curriculum by areas, considering future #0: 
innovations, and administering subtests of various batter 
appropriate instructional personnel responsible for instruct 
a given subject area. 

Table 3 presents multiple correlations of certain MAT and 
subtests. These multiple correlations are not substantially di 
ent from the zero order relationships presented in Table 1. 
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THE RELIABILITY AND VALIDITY OF THE 
CONTEMPORARY MATHEMATICS TEST 


JOSEPH S. RENZULLI 
AND 
ROBERT А. SHAW 
University of Connecticut 


Tue Contemporary Mathematics Test (CMT) is a relatively 
new instrument that is designed to measure the extent to which 
students have mastered course content in modern mathematics. 
The test series, which consists of two forms at five levels (Lower 
Elementary, Upper Elementary, Junior High, Senior High, and 
Algebra), yields total raw scores which can be converted to per- 
centile ranks, standard scores, or stanines through the use of 
normative tables. According to the Manual, the test series meas- 
Чез results common to programs in contemporary mathematics. 
Emphasis is placed on understanding concepts and skills pertinent 
o solving problems in the areas of structure and number as well as 
on special mathematical devices such as number lines and the 
coordinate system, formulas, and special symbols (California Test 
Bureau, 1966a). The content categories for all levels exeept alge- 

are defined as properties of numbers, mathematical shores, 
‘ystems of enumeration, nature and structure of proof, ratio and 
Proportion, mathematical sentences, geometry, and other topics 
including variables, functions, and graphs (California Test Bu- 
"ац, 1966). " 
Information pertaining to the reliability and validity of the 
T is limited to data gathered on the norming sample by the 
Publishers, Reliability data of the usual type are reported 
by form and grade level. Internal-consistency reliability соећ- 
ents computed by use of the Kuder-Richardson formulas 20 and 
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21 range from .70 to 89; short- and long-range stability dat 
computed by use of the Pearson product-moment correlation 
efficients (corrected for range) range from .60 to .88. Congruemi 
validity data comparing the CMT (Junior High Level) with @ 
California Short-Form Test of Mental Maturity (Junior Н 
Level, 1957 edition) yielded product-moment correlations rang 
between .50 and .62. Comparisons with the California Achi 
ment Test (CAT) (Junior High Level, 1957 edition) ев 
correlations ranging from .43 between CMT and CAT Spelling ta 
70 between CMT and CAT Arithmetic Reasoning. [ 
Purpose. The purpose of this study was to obtain additio 
empirical information relating to the reliability and validity of 
the Contemporary Mathematics Test—Junior High Level. Althot 
two reviewers (Smith, 1967; Romberg, 1968) have exp 
concern about the lack of empirical support for this ins 1 
a review of the literature indicates that apparently no independent 
research studies relating to reliability and validity have been саг 
ried out to date. j 
Procedure. A sample of 232 students in grades seven, eight, 
nine who were enrolled in a modern mathematics program fi 
minimum of three years served as subjects for the present 
Because of incomplete comparative data, relationships be! 
{һе CMT and other selected variables are based on smaller sam- 
ples. As a group, the subjects were somewhat above тб ir 
general mental ability (Mean IQ = 112); however, the entire 
spectrum of ability levels was represented in the sample popê” 
tion with the exception of students enrolled in special classes fof 
the mentally retarded. 
Reliability data consisted of pre- and post-test comparisons 6 
tween the two forms of the CMT administered at the beginning 
and at the end of the school year and of estimates derived from 
the Kuder-Richardson 20 formula. Validity data isted 0^ 
comparisons between each of the two forms of the CMT and scc 
on (1) a comprehensive final examination that was based on © 
year's work in mathematics, (2) final grades in mathematics, 
the Arithmetic Reasoning and Arithmetic Fundamentals subtests ‘ 
the California Achievement Test, (4) the mathematics portion ! 
the Sequential Test of Educational Progress, and (5) the 
Thorndike Intelligence Test. 


RENZULLI AND 8HAW ns 


Results and discussion, The summary statistics for tbe above 
comparisons appear in Tables 1 and 2. As сап be seen in Table 1, 
correlations based on retesting with alternate forms of the CMT 
ranged from a low of .75 to a high of 82. Correlations derived 
from the Kuder-Richardson formula 20 were slightly higher, 
ranging between .78 and .88. Thus, the reliability of two forms 
of the CMT over a period of approximately one school year ap- 
peared to be quite favorable, and reliability estimates based on 
individual item statistics indicated that the two forms of the 
CMT possess a relatively high degree of internal consistency. 

Data pertaining to the validity of the CMT are presented in 
Table 2. Both forms of the test seem to be more closely related 
to STEP scores, IQ scores, course grades, and final examination 
grades than to subscores on the Arithmetic Subtests of the CAT. 
It should be noted that the CAT Arithmetic portion showed a 
lower relationship with these variables than did the CMT. These 
findings may reflect the current emphasis on teaching mathemati- 
tal concepts and the focus of the CAT upon traditional skills. 
The relatively low correlations between the CMT on the one hand 
and final examinations and course grades on the other suggested 
that the test is largely independent of the content to which the 
subjects in the study were exposed. This outcome may be а result 
of the scope of the test, the emphasis given by teachers to the 
understanding of modern mathematical concepts, or an interac- 
tion between these two factors. 


TABLE 1 EL 
Standard Deviations and Correlation Coefficients for 
of the CMT 


N Administration Mean 


5. 7.10 E! 

б 7.94 B 

; 6.91 .80 

x 9.17 .88 

à 5.76 -78 

К 6.24 м 

{ 7.45 .81 

Post 20.98 8.71 57 


Кы, 
Fen y ° cients are significant at the .001 level. 


“nx, 


976 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 
Inter-Correlation among CMT (Forms W and X) and Seven Selected Variables 


) CMT-W 
N 

) CMT-X 
N 

) Dee 

|) CAT-AR* 
N 

5) CAT-AF® 
N 

8) HOM 

7) Final Examination 
N 

8) Course Grades 
N 

9) STEP 


* Indicates significance at .05 level. 
** Indicates significance at .01 level. 
* California Achievement Test-Arithmetic Reasoning. 

ъ California Achievement Test-Arithmetic Fundamentals. 
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DIFFERENTIAL EFFECTS OF INITIAL COURSE 
PLACEMENT AS A FUNCTION OF ACT MATHEMATICS 
SCORES AND HIGH-SCHOOL RANK-IN-CLASS IN 
PREDICTING GENERAL PEFORMANCE 
IN CHEMISTRY 


JOHN R. REINER 
Southern Illinois University at Edwardsville 


Јозттғтслттом for using standardized tests as a device for plac- 
ing new college students at suitable beginning levels of instruction 
generally has centered on the possible effects of poor placement on 
motivation and performance, or loss of time in attaining educa- 
tional goals (Dunn, 1966). 

Although such placement has become commonplace, validity 
studies ordinarily examine the individual course in question, and 
the criterion consequently becomes a measure which reflects the 
achievement of only one term. An example of this method was а 
recent validity study of the American College Testing Program's 
Mathematics Placement Examination (Shevel and Whiteney, 1969). 

The present study examined the possible long-term effects of ini- 
tial placement. The specific question addressed was, “What effect 
does initial placement of students in the beginning course of a dis- 
Cipline have on subsequent performance in the discipline?” 

Methodology. Subjects were 250 students who, as entering fresh- 
men, had been placed in the first of a three-course beginning chem- 
Stry sequence (normally considered the appropriate beginning 
Course for science majors) and 88 students who had been placed 
the second course of the same sequence on the basis of a locally- 
developed examination. In the course of а validity study of this 
Procedure, it was discovered that the locally-derived instrument 


Wag contributing very little to the proper placement of students 
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R? = 0.13). Consequently, the same subjects used in the validity 
study were employed to develop a "best" prediction model for - 
future placement, which resulted in a two-variable equation using 
the American College Testing Program Standard Mathematies 
Score (ACTM) and high school rank-in-class percentile (RIC). 

At the end of the Spring Quarter 1970, after the originally- 
placed students had been in attendance for two academic years, 
their grades in six “key” chemistry courses were recorded, if avail- 
able, and used in the present study. 1 

Statistical procedures followed a multiple-regression technique to 
obtain F ratios by which to judge the significance of the extent {0 
which predictor variables account for the variability of a criterion 
(Kelly, Beggs, and McNeil, 1969). In this study, the criterion was 
cumulative grade-point-average (GPA) in the six key chemistry 
courses, and predictors were continuous scores ACTM and RIC and. 
two categorical predictors (PL, and PL») representing “first course 
placement” or “second course placement.” The formal hypothesis 
tested was: “The model representing effects of PL across all levels of 
RIC, across all levels of ACTM, does not add significantly to the 
predictability of GPA by a model representing the linear effects of 
the three variables alone.” The significance of the effects of restrict- 
ing the "interaction" model efficiency (Е?) was tested by examining 
the obtained F in terms of the probability level (alpha) established 
for this study, 0.05. 

Analysis of results. Data used to test the hypothesis are shown in 
Table 1. On the basis of these data, the formal hypothesis was rejected 
and an alternate hypothesis was accepted as tenable: The model 
representing effects of PL across all levels of RIC, across all levels 
of ACTM, adds significantly to the predictability of GPA by ® 


TABLE 1 
Test Data for Determining Effectiveness of Differential Predictor 
Interaction in GPA Prediction 
SS 
df, dj, Кр R}? F Р Hi 
2 332 0.33 0.28 12.98. «0.001 ME 


Note.—d/; and df, are degrees of freedom for the “full” “restricted” models, mber 
The subscripts "f" and “т” with R? have similar meaning. cq Ms Sd "i is the num 
of linearly independent vectors in the full model and т» is the number of linearly indepen | 
расы in the restricted model. df- = (N — mi) where N is the number of subjects and m1 is dean / 
as above. | 
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model representing the linear effects of the three variables alone. To 
state these results in another way, initial level of placement has dif- 
ferential effects on GPA at different levels of RIC across different 
levels of ACTM. 

The general prediction equation developed to reflect the model 
tested was as follows: 


Y, = a + b, X, + b, X, + Xs + ЫХ, + XS + ЬХ,+ЬХ„, + 
ҺХ,, where: 
Y, = Criterion = GPA 
X, = Exclusive group membership (PL) 
X, = Exclusive group membership (PL;) 
X, = (RIC) (PL) 
X, = (RIC) (РЇ) 
X, = (ACTM) (РШ) 
X, = (ACTM) (Pla) 
X, = (RIC) (ACTM) (РЇ) 
= (RIC) (ACTM) (PL;) 
а = constant or regression weight associated with the unit vector. 
b, through b, = weights associated with X, through X, and cal- 
culated to maximize the variance accounted for 
by the model. 
Using the calculated values the general equation becomes, 
Y, = 0.7318 + 2.2929X, + 1.0300Х, — 0.0194X, — 0.0200X« 
— 0.0469X, — 0.0146X, + 0.0014X; + 0.00207, 
Then two equations can be developed using the predictors in- 
peer 
àcement in first course: 
Y, = 0.7318 + 2.2929(1) + 1.03000) — 0.0194(RIC) — 
0.0469 (ACTM) + 0.0014 (ACTM) (RIC) 
үсеше in second course: 0.0290 (RIO) 
+ = 0.7318 + 2.2929(0 1.0300(1) — 0. 
5 aie (ACTM) + 0.0020 (ACTM) (RIC) 


Table 2 shows the predicted grade-point-average for students 
according to the course into which they were initially placed, at 
representative levels of ACTM-RIC combinations. The data in Table 

are not absolutely complete, for they are only meant to illustrate 
à continuous, three-variable interaction. It is clear, however, that 


‘nitial placement has a differential effect, at different levels of ACTM 


be 
* 
I 
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TABLE 2 


Predicted Grade-Point-Averages at Representative Levels of ACT M-RIC Combination 
(A = 5, B =4,C =3,D = £ E = 1) 


ЕЕЕ ———————— 
RIC 

Placement „——————- 

АСТМ Level 40 60 80 ” 


мх lst Course Placement 2.47 2.75 3.04 3.30 
2nd Course Placement 1.52 1.90 2.28 2.64 
26 Ist Course Placement 2.49 2.83 3.17 3.49 
2nd Course Placement 1.60 2.06 2.53 2.96 
28 1st Course Placement 2.50 2.90 3.30 3.07 
2nd Course Placement 1.68 2.22 2.16 3.27 
30 Ist Course Placement, 2.52 2.97 3.43 3.86 
2nd Course Placement 1.75 2.37 3.00 3.58 
32 1st Course Placement 2.54 3.05 3.56 4.04 
2nd Course Placement 1.83 2.53 3.24 3.90 
34 1st Course Placement 2.56 3.12 3.69 4.22 
2nd Course Placement 1.91 2.69 3.47 4.21 


and RIC, on subsequent performance. Comparing the predicted per- 
formance of the two groups in the courses in question, it is apparent 
that students initially placed in the first course would be expected to 
attain higher GPA's if their КІС” were in the middle or lower ranges 
regardless of ACTM. As ACTM becomes greater, of course, the pre- 
dicted scores for the two groups become more and more nearly similar. 
It is only in the highest ranges of RIC distribution that students ini- 
tially placed in the second course would be expected to do as well or 
surpass those initially placed in the first course. 

Implications. This study was concerned only with a single disci- 
pline at a single institution, and further information about the phe- 
nomenon investigated should be gathered by replication at different 
institutions, and certainly by applying the basic procedures used to 
data from different disciplines. However, the findings do support 
at least this important point: the uncritical acceptance of placement 
by examination as a device to benefit students may not be an ш" 
mitigated good. Although the grade-point-average might not provide 
a very satisfactory educational criterion, it is widely used as а selec- 
tive criterion by graduate schools and prospective employers. Thus, 
any procedure which by virtue of only theoretically valuable edu- 
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estional devices places students in а less competitive graduate school 
or employment position, deserves serious and continuing study. 

From a practical viewpoint, the use of an equation of the com- 
plexity developed in this study is quite possible with a reasonably 
sophisticated computer installation. Regardless of specific methods, 
it is important to note that in order to benefit students, placement in 
advanced courses should be based only on relatively high scores on 
relevant predictor variables to avoid the problems discussed. It is 
possible that the positive aspects of motivation and savings of time 
associated with advanced placement are outweighed by decrements 
in subsequent performance. In any event, extensive cross-validation 
efforts would be required before the results of this investigation 
could be generalized to this population or to other populations. 


Pu e: thematics 
Dunn, J. E. A study of the University of Arkansas Ma 
E Exam as {1 placement, device. Journal of Ezperimen- 
tal Education, 1966, 34, 62-68. ё 
Kelly, F. J. Beggs, D. L., and MeNeil, K. A. Research design ín 
the behavioral sciences: Multiple regression е, Carbon- 
dale and Edwardsville, Illinois: Southern niversity 


Press, 1969. alidity mathe- 
Shevel, L. R. and Whitney, D. R. Predictive v ERN: 


matics Placement Examination. EDUCATIONAL AND 
CAL MEASUREMENT, 1969, 29, 


ЁэссАлттохА:. AND PSYCHOLOGICAL MEASUREMENT 
1971, 31, 983-987. 


THE CRITERION-RELATED VALIDITIES OF COGNITIVE 
AND NONCOGNITIVE PREDICTORS IN A TRAINING 
PROGRAM FOR NURSING CANDIDATES 


WILLIAM B. MICHAEL, RUSSELL HANEY, aw» YOUNG B. LEE 
University of Southern California 
AND 
JOAN J. MICHAEL 
California State College, Long Beach 


Ir was the three-fold purpose of this investigation (1) to cite the 
validity coefficients of seven standardized cognitive test mes- 
sures, four indices of high school achievement, and two scales from 
each of two self-report inventories in the prediction of grades in each 
of eight courses in a program of nursing education for the 1969-1970 
Period taken by a total sample of 128 students at the Los Angeles 
County Hospital, (2) to report validity coefficients with respect to 
the same combinations of predictor and criterion variables just men- 
tioned for a sample of 96 candidates who survived the first part of 
the program and continued during the second segment, and (3) to 
indicate for this sample of 96 successful candidates the validity co- 
efficients of the same predictor variable with respect to each of eight 
additional criterion measures representing other course work in the 
nursing program. For the first two purposes the findings are cited 
in Table 1, and for the third purpose the corresponding data are 
furnished in Table 2. In addition, the intercorrelations within each 
of the two sets of eriterion measures are presented in these tables. 
Additional information regarding many of the measures employed 
as well as findings with prior samples may be found in the article 
by Michael, Haney, and Jones (1966) and in the previous articles 
cited in its bibliography. 

Findings and interpretation. The fin 
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dings may be summarized 
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and evaluated as follows: 


1. The most valid predictor of success in the nursing program was 
the California Reading Test—Comprehension, although the oaf- 


relations for the most part were modest or low. 


2. The second and third most valid predictors were given by grade 
point average earned in high school academic subjects (тай 
able nine) and overall high school grade point average (та 
able eight), although in selected subjects the California Mathe- 
matics Test—Reasoning was almost as valid as these two other 


predictors. 


3. With the possible exception of the low validities for the two 
criterion variables of Physiology and Nursing 2 the measure 
of spatial visualization (variable seven) was not particularly 


predictive of success in any course within the program. 


4. The two self-report inventories—the 16PF and the MMPI— 
yielded correlations of virtually no predictive value, аз reflected 
by the fact that among 240 possible validity coefficients only 
12 were significant at the .05 level for the total sample and 
only 24 were significant at the .05 level for the successful sample 
of 96 candidates. Only two scales from each of the two self- 
report inventories yielded two or more validity coeficients 
significant at the .05 level for the sample of successful students. 


These scales are the ones cited in Tables 1 and 2. 


5. Evidence from Tables 1 and 2 points to the relatively low degree 


of intercorrelation among the criterion measures for both 


total sample and the group of successful canddiates. Although 
these coefficients might indicate that different characteristic 
were being evaluated in the several courses, their relatively lo" 


magnitude in relation to observations made by the writers in 
other school-oriented contexts would suggest that the reliability 


of the grading process might be open to question. 


6. Among the intercorrelations of the criterion measures (35 ТЄ 
ported in Table 2) it should be noted that the correlations of 
the ward adjustment measures are near zero or negative tO ® 
slight degree, although in seven instances, significantly, With 


the measures in the other courses. 


Conclusions. It is apparent from the relatively low magnitude of 
the correlations reported that despite the probable presence of 8002 


restriction in range of talent consideration needs to be given not 
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to the introduction of other measures for the selection and placement 
of students in the nursing program but also to posible тоб бевл ови 
in supervisory and in-service activities directed toward the impronte 
ment of the evaluation process. Although reliability data are pot 
present for the criterion measures with the single exception of the 
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A NOTE ON THE VALIDITY OF TWO MEASURES 
OF HIGH SCHOOL RANK 


GERALD W. McLAUGHLIN 
United States Military Academy 
West Point, New York 


Ат present, two main ways exist for a college to utilize an appli- 
cant’s high school academic performance. The first of these is the 
use of grades in high school, either an average or а weighted average 
of selected grades. In a recent report, the use of four weighted grades 
in multiple regression analyses resulted in median validities of .44 
to.55 for a sample of 398 colleges (Munday, 1967). 

The second method of incorporating this information is to use à 
measure of high school rank (HSR). The rank is usually expressed 
as a form of (1 — R/N) х 100, where R is the applicant’s rank in 
his graduating class and N is the size of the graduating class. It 
also has validity for predicting college academie performance (Bor- 
gatta and Bohrnstedt, 1969; Lavin, 1965). 

Purpose. Both of these measures customarily improve the validity 
of predicted freshman grades. However, the use of HSR seems to 
be intuitively more appealing because of its simplicity. The purpose 
of this paper was to propose modification in the measurement of a 
college applicant’s graduating rank in his high school class. 

Procedure, The two measures of HSR, used in this study were 
(а) the standard predictor (HSR1) computed as (1 —R/N) X 100, 
and (b) a modified form (HSR2) computed as (1 — R/CN) х 100 
Where CN is the number of those in the applicant’s high school class 
Planning to attend college. The size of the college bound segment was 
furnished by an official at the applicant's school. The sample con- 
sisted of 194 cadets entering the United States Military Academy 


71 1968 on whom complete data were available. 
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TABLE 1 
Descriptive Statistics for Two Measures of High School Rank Scores 


Rank 

Index X SD SAT-V SAT-M ENG-C MATH-A GPA HSRI 
HSR1 84.79 13.74 .182 .157 .183 .234 440 Я 
HSR2 72.75 21.50 .237 .204 .252 .294 .549 .70 


* The symbols HSR1 and HSR2 are described in the text. 


The major criterion was a student's overall academic grade point 
average (GPA) at the end of his freshman year at the Academy. The 
degree of relationship between these rankings and the performance 
on each of four standardized tests of the College Entrance Examina- 
tion Board (CEEB) was determined. | 

Findings. 'The results are shown in Table 1. The modification oi 
the denominator in the High School Rank Score significantly im- 
proved the validity of the score for this sample (p < .01). This cor- 
rection in rank is a logical one, since the HSR score is а measure of 
an applicant’s performance relative to his peer group. If he plans to 
attend a college, a first approximation of his academic peer group 
size is the number of those in his graduating class also planning to 
attend college. This correction is also relative to the quality of high 
school, since the score of an applicant from a school sending every- 
one to college would be greater than the score of an individual with 
the same rank in an equivalent size graduating class where fewer 
graduates went to college. 

Conclusion. The relationships of the measures with the CEEB 
tests suggest that HSR can provide a unique contribution to prè- 
dicting an individual’s academic grade point average. 
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MULTIVARIATE VALIDITY OF THE OTIS-LENNON 
MENTAL ABILITY TESTS PRIMARY I LEVEL 


BRAD S. CHISSOM ax» JERRY R. THOMAS 
Georgia Southern College 


Mrasvres of academic achievement in the form of teacher ratings 
are often used as the criterion measure in validity studies. One of 
the limitations of teacher ratings is that they ignore the specificity 
of learning in the several areas comprising total achievement. A 
more complex rating based on specific areas of achievement would 
seem to offer greater possibilities for effective measurements. This 
study employed a complex teacher rating scale of kindergarten 
children as a criterion measure for validating the Otis-Lennon 
Mental Ability Test Primary I Level (1967). A validity co- 
efficient was obtained through the use of canonical correlation by 
correlating the parts of the Otis-Lennon with the parts of the teacher 
ratings, a procedure suggested by Mukherjee (1966). А 

Studies employing teacher ratings of kindergarten children as 
Measures of achievement have been conducted by Koppman and 
LaPray (1969) and Meyers, Attwell, and Orpet (1968) with some 
degree of success. For kindergarten children, & single teacher rating 
had only a moderate correlation with a single objective measure of 
academic aptitude. Е 1 i 

Method—Subjects. The children employed as subjects in this 
study consisted of two classes of 20 kindergarten children (N = 40) 
ftom the Georgia Southern College Laboratory School. The mean 
age for the group was 67.48 months at the time of test adminis- 
tration (March, 1971). | 

Instruments, Two instruments were employed in this study. The 

st was a complex teacher rating seale in which the two kinder- 
Warten teachers were asked to assign а numerical rating from one 
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to nine for each child in four academic areas: (1) Readin 
Quantitative, (3) Verbal, and (4) Listening. The second me 
was the Otis-Lennon Mental Ability Test Primary I. The @ 
Lennon MAT consists of fifty-five items divided into two pams 
Part I requires the subject to identify the different pictw 
a group of four pictures, and Part II directs the subject to 
the correct picture corresponding to a verbal description. Both p 
of the test are administered orally to the subject. 
Results. Reliability for the total score on the teacher rating y 
sure, which was calculated by Cronbach's Alpha, was estima 
be .92. The Otis-Lennon MAT reliability coefficient for total 
which was computed using the split-half odd-even method and 
increased by the Spearman-Brown Formula, was equal to 91. | 
The resulting canonical correlation between the two sets of vi 
bles was .76, significant at less than the .001 level. " 
Examination of Table 1, which contains means, standard d via- 
tions, and beta weights, indicates differential weightings for tht 
teacher ratings. Reading readiness and quantitative ability ¢ 
the heaviest weights, while listening ability is weighted педа 
The two parts of the Otis-Lennon are weighted approximately € 
Summary. This study has demonstrated that teacher ratin 
pear to be more useful when they are composed of several p& 
Further, canonical correlation analysis is a feasible technic 
"assessing validity when both measures incorporate part 800! 
subtests as frequently found in academic and intellectual measum 


TABLE 1 d 
Means, Standard Deviations, Beta Weights, Reliabilities, and Canonical Correl 


Z-Score 
Standard Beta 
. Deviations | . Weights 


5.58 1.62 871 
5.93 1.53 .524 
6.13 1.22 .226 
6.18 1.38 —.280 
13.45 4.58 749 : 
14.97 5.05 ies 


Re = Лб, = 38.86 (df = 8), p < .001 
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It should be emphasized, however, that cross-validation efforts with 
sew samples would be necessary before these results can be gen- 


eralized. 
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THE CONCURRENT VALIDITY OF THE SPRIGLE 
SCHOOL SCREENING READINESS TEST FOR A 
SAMPLE OF PRESCHOOL AND KINDERGARTEN 
CHILDREN 


MARIA 8. A. SEDA 
Perris Union High School District, California 


JOAN J. MICHAEL 
California State College, Long Beach 


Problem. It was the primary purpose of this study to determine 
the degree of relationship between the Sprigle School Readiness 
Screening Test (Sprigle) and the Metropolitan Readiness Test 
(MRT) with the view to substituting the Sprigle for the Metro- 
Politan. Secondarily, it was the purpose to investigate the rela- 
tionship between scores on the Peabody Picture Vocabulary Test 
(both Peabody IQ score and Peabody raw score) and the MRT. 

Procedure and subjects. One hundred children (25 preschool and 
75 kindergarten in suburban Southern California), ranging in age 
from four years and 10 months to six years and nine months, were 
given the three tests. Starting in June 1970 and continuing through 
January 1971, the MRT (Form A), the Sprigle, and the Peabody 
(Form A) were administered in random order "ичн to the 
manuals of directions. Regardless of the order, was & 
week's separation between the administration of the MRT and the 
random administration of the Sprigle and the Peabody in light of 
the time involved in giving the Metropolitan and the limited at- 
tention span of this age child. i 

Results. As shown in Table 1, there was а correlation of 43 (p 
< 01) between the Sprigle and the MRT. Further the correlation 
between the Peabody IQ scores and the MRT scores was 58 
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TABLE 1 


там ida Matriz for the Sprigle School Screening Readiness Test, the P. 
Picture Vocabulary Test, and the Metropolitan Readiness Test 


Variable 


Peabody Peabody 
IQ scores raw scores 


Note.—All correlation coefficients were significant beyond the .01 level. 


(p < .01) ; whereas the correlation between the Peabody raw 800 
and the MRT scores was .45 (p < 01). 
It should also be noted that when a multiple R was comput 
between the composite made up of the Sprigle and Peabody IQ 
scores on the one hand and scores on the МЕТ on the other, а @0 
relation of .73 was found. Similarly, when Peabody raw scores 
substituted for the IQ scores, a multiple R of .75 was found. Ш 
neither ease was there a statistically significant increment in 
with the addition of the Peabody. 
Discussion of Results. It might be of incidental interest to note 
that when the correlation of .73 between the Sprigle and the MR’ 
was compared with the correlation of .58 between the Peabo 
raw score and the MRT, the Hotelling’s t+, value (Guilford, 1969 
р. 190) was 2.53 (р < .05). Further, when the correlation of 78 
between the Sprigle and the MRT was compared with the COP 
relation of 45 between the Peabody IQ scores and the MRT, the 
ta, value was 427 (p < 01). 
eid = light of the high degree of correlation between the Sprigle 
and the widely used MRT (73) and the fact that the addition © 
neither the Peabody IQ score nor raw score produced any signifies 
increment in R, there seems to be evidence of concurrent validi 
of the Sprigle by itself as an effective screening device. This € 
clusion is further supported by the finding that the Sprigle 
demonstrate greater potential predictive validity than can tHe 
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Peabody (either raw score or IQ score) when the criterion variable 
is raw score on ће MRT. 

Since the general long term goal of this study was to increase 
success and decrease failure in primary grades by identifying а 
test which would effectively predict school readiness, it was of in- 
terest to find not only an instrument which would be quick to give 
and valid, but also one which would provide the most information 
about the process a child uses as he attacks the tasks given to him. 
In addition to the high degree of correlation with the MRT, other 
salient features of the Sprigle are that it can be administered in 
10 to 15 minutes by a nonprofessional and that diagnostie infor- 
mation pertaining to how each task is undertaken can be obtained 
by its individual administration. 

It is, therefore, recommended that school districts investigate 
the use of the Sprigle as a possible quick screening instrument for 
kindergarten children which can yield information to assist teach- 
ers and administrators in assessing proper placement for entering 
school age children. 
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А MULTITRAIT-MULTIMETHOD VALIDATION ОР 
MEASURES ОЕ STUDENT ATTITUDES TOWARD 
SCHOOL, TOWARD LEARNING, AND TOWARD 
TECHNOLOGY IN SIXTH GRADE CHILDREN 


SOL M. ROSHAL, IRENE FRIEZE, лхо JANET Т. WOOD 
Institute for Development of Educational Activities, Ine. 


Ерослтовѕ and parents are becoming increasingly aware of the 
importance of student attitudes. The child is often expected not 
only to learn the required subject matter, but also to enjoy school 
and to look forward to learning new things. Also, there is concern 
among industrial leaders as well as educators that children appre- 
ciate the benefits of technology and that they not be afraid of the 
many machines in their environments. Previous studies (Coleman, 
Campbell, Hobson, McPartland, Mood, Weinfeld, and York, 1968) 
have shown that student attitudes are related to school achieve- 
ment. In their massive studies of United States schools, Coleman, 
et al. found that attitudes towards school and learning were 
significant indicators of verbal skills in sixth graders. Other studies 
have shown positive, but often nonsignificant relationships be- 
tween grades and achievement scores and attitudes (Jackson and 
Lahaderne, 1967; Brodie, 1964). 1 

Even with the increasing interest in measures for assessing stu- 
dent attitudes, there are few existing instruments in the literature. 
Many of those which do exist are parts of larger instruments. 


Independent measures (Coleman, et а 
measure only attitudes towards school (Flanders, 1965; Jackson 
and Getzels, 1959) and often do not report validity data. Thus, 
there is а need for reliable and validated instruments to measure 
Various important school attitudes. The present study involved the 
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validation of three student self-report attitude measures: А 
Toward School (ATS), Attitude Toward Learning (ATL), 
Attitude Toward Technology (ATT). 

Validity assessment of attitude scales is difficult in general Ш 
cause absolute criteria for knowing who has a positive or пей 
attitude are not readily attainable. Usually, the only data & 
to validate an instrument are correlations with other measures) 
equally low reliability and/or validity. The most promising 
proach to this difficult problem of validity appraisal of 
scales is the multitrait-multi-method matrix proposed by 
bell and Fiske (1959). Its use requires that several traits be 
sessed with several independent methods. In the Campbell 
Fiske terminology the "traits" of this study were the three typ 
of attitudes; the three “methods” used to assess the attitudes 
the newly developed instruments, teacher ratings, and peer ratings 

Method—Source and selection of Items. Large numbers of i 
were constructed on the basis of content validity (items beli 
by educational specialists to measure the respective attitude) 
each of the three scales. Attitude Toward School (ATS) items f 
sessed feelings about school as an institution, about teachers 
other school personnel, classmates, school subjects, and the ¢ 
room. Items for Attitude Toward Learning (ATL) were concer 
with the student’s general interest in the world, curiosity, int 
in school subjects, reading, hobbies, and other learning activit 
Attitude Toward Technology (ATT) had items from three 
eral areas: personal control and understanding of machines, ma 
ability to control technology, and the positive benefits of te 
nology. After several preliminary item analyses studies, which ¥ 
ized factor analyses and item-total correlations, the final versiot 
were constructed. They consisted of 25 items for ATS and ATL: 
24 for АТТ. Both positively and negatively worded items We 
used to control for response bias. Items were answered on 8-рОй 
Likert scales. Sample items from the final version of each scale 8 
given in Table 1.1 ] 

_ Administration procedures. The three scales, ATS, ATL, 
ATT, were administered along with other questionnaires 


1 Inquiries the instruments and the development procedures П 
be sent to |I А], Research Division, 1100 Glendon Avenue, Suite 


-—— 


ATTITUDE TOWARD SCHOOL 
Ж fa always 
b. usually 
I 4e. sometimes > hate school, 
d. rarely 
e. never 


ri 
Teachers in this school are 4 €. friendly. 
d. rarely 


e. never 
ATTITUDE TOWARD LEARNING 
1 a. always 


School subjects are 4 c. sometimes р bering. 
d. 


t te 
many 
Whenever I go on a trip, I learn 4 €. some new things. 
d. afew 
e. no 


ATTITUDE TOWARD TECHNOLOGY 


| 

L fa. strongly agree 

b. agree 

1 le partly agree, partly disagree р that most mew inventions Mig 
Ч. disagree people line better, 

*. strongly disagree 


п. а. always 


b. usually 
Teould 1с. sometimes learn how to fiz атом anything. 


d. rarely 


е. never 


Peer ratings to a sample of 610 sixth grade students in 13 p 
*hools. Their average Lorge Thorndike verbal 
‘standard deviation of 15.7. There were iddl 
Numbers of boys and girls. The sample ranged from lower 


© lower upper class in socioeconomic status (as judged by school 


ict personnel). 
Fo ; asked to list two ar more names, 
, “or peer ratings students were Pi of the following 


“eluding themselves if they desired, 
Westions: 


Which kids in this class really seem to enjoy school а lot? 


1002 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Which kids are really interested in learning about a lot of 
different things? 

Which kids in this class really enjoy using machines or fixing 
things that break down? 

The peer ratings were then computed by counting the number of 
times each student was chosen, dividing by the number of students 
in the elass (to equate the number of possible choices for different 
sized classrooms) and multiplying by 25 (to make the numbers 
whole values representing approximately the number of times 
chosen). Zero scores (1.е., по choices) were eliminated. 

Teachers were also asked to rate each of their students with 
respect to: 

1. Attitude toward school 

Does he like school or not? Does he look forward to school 
and enjoy being there? 

2. Interest in learning both in and out of school 

Does he enjoy learning new things? Is he generally curious? 

3. Interest and understanding of machines 

Does he feel comfortable with various machines? Does he 

understand that machines cannot purposefully harm him? 
These ratings were done on 5-point sclaes from “very high" to 
“very low.” 

Resulis—Descriptive data. Mean scores for the three attitude 
seales were 3.2 for ATS, 3.4 for ATL and 3.2 for ATT. The stand- 
ard deviations were -67, .48, and 88, respectively. Since the pos- 
sible scores ranged from one for very unfavorable to five for very 
favorable, the overall averages were all slightly more positive than 
the neutral point. There was most variance in ATS scores and least 
in ATT. 

The scales had relatively high reliabilities as measured by Alpha 
coefficients (a form of K-R 20 or split-half reliabilities for multi- 
point responses) as shown in Table 2. ATS was most consistent 
with an Alpha of .93. ATL had a value of .84; and ATT, only .68. 

Factor analyses (with orthogonal rotations) indicated that the 
ATS items had high loadings on one main factor—an outcome 
indicating the possible unidimensionality of this attitude. Three 
major factors emerged for the ATT corresponding to the three 
dimensions originally used for content validity. ATL was a more 
complex test that yielded four major orthogonal factors. These 
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were tentatively labeled as: 

1. Boredom or lack of interest in life. 

2. Enjoyment of learning new things. 

3. Interest in school learning. 

4. Interest in reading. 

Multitrait-multimethod analysis. Analysis of the trait and method 
intereorrelations (shown in Table 2) by the Campbell and Fiske 
criteria indicated that all three instruments met most of the cri- 
teria. 

The criteria and the relevant indices were: 

1. The reliability of each trait measure should be significantly 
greater than zero and it should be greater than the correlation of 
the trait with the other trait measures. Reliabilites of the three 
scales (ATS, ATL and ATT) were well over the 09 needed for 
significance and were also greater than other scale intercorrelations 
(Matrix 1). 

2. Diagonal values (italics) in validity matrices (Matrices 2, 
3 and 5) should be greater than other values in the same row 
or column within that matrix. Thus, in Matrix 2, the validity cor- 
relation of teacher ratings of student school sttitudes with ATS 
is 30. This is greater than the other values in the row (27 and 
01) and column (.18 and .00). Including the row and column 
comparisons for all three matrices, there are a total of 6 compari- 
sons for each scale. 

a. ATS: of the 6 tests, 5 met this criterion 

b. ATL: 3 of 6 met criterion 


€. ATT: 4 of 6 met criterion 
3. Correlations of traits between different methods should be 


higher than correlations of different traits within the same method. 
This criterion was not met since teacher ratings, peer та ngs, and 
‘elf-reports of the three traits tended to correlate more with each 
other within a given matrix than with each other across methods. | 

4. The same pattern of interrelationships should be found within 
all methods, This criterion was met for all instruments. ATS and 
ATL tended to correlate with each other and not with ATT. YT 
relations: of ATT with ATL were higher than of ATT with 
ATS. 

Discussion. All three scales yielded promisin 
tations аз measured by the multitrait-multime 


g validation indi- 
thod criteria. In 
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fact, а review by Campbell and Fiske (1959) indicated that few 
instruments in the literature at that time met more than one of tro 
of these criteria while all three seales here met many of them. 
The reliabilities of the three instruments were also as high aa or 
higher than similar instruments. 

ATS, which measures the student's general attitude towards 
school as an institution, might be used by educators to measure 
feelings about school. It is probably relatively sensitive to attitude 
changes (although further studies of this are needed). As shown in 
Table 2, the correlation of ATS and ATL was .68 as might be е. 
pected because of the inclusion of school learning items on ATL. 
However, the multitrait-multimethod correlations do give support 
for the independence of the two instruments even though both 
teachers and peers had some difficulty differentiating the two con- 
cepts. ATL, which indicates а more general orientation toward 
learning, probably does reflect more of а personality trait than 
does ATS and thus would not be so susceptible to short term 
changes as are attitudes toward school. 

ATT, which measures the student’s perception of his ability to 
fix and run machines, his belief in man’s general ability to control 
machines, and his beliefs about the positive benefits of technology, 
correlates very low with ATS and ATL. ATT might be of interest 
to educators with curricula related to the use of technology, or 
With projects which have machines (such as teaching machines) 
às part of the program. Also, realistic concepts 7 machines аге im- 
portant for daily living in a complex environmen 

Any of the scales may be admini independently or in 
combination for elementary school assessment. They are presently 
being used with children in third through fifth grades as well as 
With children in the sixth grade. Although the reading difficulty of 
Words used on the scales was purposefully kept low, use with aver- 
age readers below the fifth grade is nob recommended, however, un- 
less the items are read aloud. Normative data for sixth grade pu- 


Pils are available for all three scales. 
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DOGMATISM AND CONSERVATISM: AN EMPIRICAL 
FOLLOW-UP OF ROKEACH'S FINDINGS 


FRANK COSTIN 
University of Illinois at Urbana-Champaign 


In partial support of his claim that the Dogmatism Scale meas- 
ures “general authoritarianism,” and is “relatively free of political 
content,” Rokeach (1960, pp. 121-122) has cited low correlations 
between scores on his scale and scores on the Politico-Economie 
Conservatism Scale (Adorno, Frenkel-Brunswick, Levinson, and 


| Sanford, 1950). However, since these correlations were consistently 


positive, Rokeach also concluded: . . . “The chances are somewhat 
better than even that a closed-minded person will be conservative 
rather than liberal in his politics." (р. 122). 

Unfortunately, the measure of “conservatism” which Rokeach 
used consisted of only five items. In describing the construction of 
this instrument, Daniel Levinson pointed out that for practical 
reasons it became necessary to reduce it considerably from its 
original length, and readily conceded that the five items were “not 
enough to obtain an adequate measure of reliability, and hardly 
enough to be called a ‘scale.’” (Adorno, et al., 1950, p. 168). Fur- 
thermore, the items tended to be broadly stated: e.g. “In general, 
full economic security is bad; most men wouldn’t work if they 
didn’t need the money for eating and living.” “America may not 
be perfect, but the American Way has brought us about as close as 
human beings сап get to a perfect society." (Adorno, et al., 1950, 
D. 169). jet: 

What relationship between dogmatism and conservatism might 
be obtained if (a) a more reliable measure of “conservatism” were 
used, (b) the items dealt with more specific and current issues, 
and (o) “conservatism” was more operationally defined? The na- 
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tional election campaign of 1970 provided an excellent opp 
ity to answer this question. : 
Method. A senatorial candidate from the Midwest set fort 
position as a "conservative" in a newspaper advertisement 
inviting readers to take a "test" to see how "conservative" OF 
eral" they were. The “test” consisted of a series of 15 paired 
ments; in each pair.one of the statements represented a "cons 
vative” opinion, and the other а “liberal” opinion. For examp 
I am for stronger laws to curb the sale and dis 
pornographic materials. 
We should not restrict the freedom of publishers or 
producers to produce or sell апу material they choose. 
Law enforcement officials are too hard on protesters. 
I am in favor of taking a stronger stand against pre 
who engage in violence on the campus and in our communiti 
In addition to the issues reflected in the above examples, ¢ 
paired statements dealt with President Nixon, Vice-President J 
new, Vietnam, bussing to achieve racial balance in schools, J 
Hoffman and the Chicago 7, and the Federal welfare programs 
Readers were asked to score their answers according to a key ї 
dicating which statements were "conservative" positions and 
were "liberal" ones; thus, according to the advertisement, reade 
could compare their opinions with those of the candidate. (He suf 
ported all of the “conservative” statements). 
Ten pairs of statements were selected from the advertisem 
to represent a minimum of redundancy and a maximum of spet 
ficity. Each pair included a “conservative” opinion and a “Ш 
eral” opinion on the same issue. The 20 statements were combine 
with the 40 items of the Dogmatism Scale, Form E (Rokeat 
1960, pp. 73-80), and all arranged in a random order to fe 1 
60-item questionnaire labeled “Opinion Survey: Social-Psychole 
ical"* (All items on the Rokeach Scale are stated in the “dog: 
matic” direction). D. 
During the summer of 1970 the "Survey" was administered to 
78 students (40 men, 38 women), selected randomly from 
subject pool of an introductory psychology course. (Eighty-f 
students had been asked to participate, but seven failed to 


1 А copy of this instrument may be obtained by writing to: Frank Cos! 
731 Psychology Building, University of Illinois, Chapel Illinois 6 
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port). Directions for responding to the items followed the standard 
instructions of the Dogmatism Seale (Rokeach, 1960, pp, 72-73); 
however, instead of scoring responses on a scale from + 3 ("agree 
very much") to —3 (“disagree very much"), а seale from 6 to 1 
was used, with 6 corresponding to +3 and 1 corresponding to =3. 
All dogmatism items and the 10 "conservative" statements were 
scored accordingly. Direction of scoring was reversed for the 10 
“liberal” statements, so that “disagree very much” was assigned 
a score of 6, and “agree very much" a score of 1. 

For the purpose of computing reliability coefficients. (KR-20), 
responses were also scored as 1 or 0; in the case of a dogmatism 
item or a "conservative" statement, any degree of agreement was 
assigned a score of 1; for a “liberal” statement, any degree of 
disagreement was assigned a score of 1. Based on the responses of 
All 78 students, KR-20 was .78 for the Dogmatism Seale and 
79 for the remaining 20 items. (Correlations between total scores 
obtained under dichotomy procedures and those obtained under the 
six-point system were .87 for the Dogmatism Scale and .90 for the 
other 20 items). 

Results. Table 1 shows the correlations (r's) between scores on 
the Dogmatism Scale and the scores on the 20 items measuring 
“conservatism”; it also reports the means of these scores. For all 
78 students, the correlation between "dogmatism" and "conser- 
vatism" was .56. This coefficient is significantly greater than any 
of those which Rokeach obtained when he correlated college stu- 


TABLE 1 
Relationship between Dogmatism and Conservatism 
Dogmatism* Conservatism’ —— Dogmatism vs 
Mean SD M SD Conservatism (ғ) 
LJ 
Men (N = 40 1314 28.2 512 135 63 
Women (N = E 129.0 17.9 47.1 Т к 
Total (ү = 78) 130.3 23.6 49.2 12.8 4 


Note— All respondents were students in an introductory psychology the 40 items was scored 
* Dogmatism was measured with the Rokeach Scale, Form E. Bid higher che acre the greater 
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dents’ scores on the Dogmatism Scale with their scores on the 
five-item Politico-Economie Conservatism Scale. The r's he reported, 
and the significance level of each difference, were as follows: 11 
(New York Colleges, N = 207, р < .01); .13 (Michigan State 
University, N = 202, p < .01); .20 (Michigan State University, 
N = 153, р < 01); 28 (Michigan State University, N = 186, 
р < .05). (Rokeach, 1960, p. 122). 

Table 1 also shows that the correlation between “dogmatism” 
and “conservatism” was higher for men than for women (.63 vs 
45); however, the p value for the difference between these two 
r’s was greater than.05. 

These results indicate that the relationship between “conserva- 
tims” (politieal-economie-social) and Rokeach’s interpretation of 
“elosed-mindedness” may be stronger than he realized. The findings 
may also reflect the advantage of specifying operationally the 
“conservatism” one intends to measure. Of course, more extensive 
investigations of this kind need to be carried out to see whether 
such conclusions can be further supported, by using either the 
“conservative” and “liberal” statements employed in the present 
study (the issues are still lively ones) or some similar measure. 
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SPECIFIC ANXIETY THEORY AND THE 
MANDLER-SARASON TEST ANXIETY 
QUESTIONNAIRE 


FRANK B. W. HARPER 
University of Western Ontario, Canada 


Тнв specificity theory of anxiety advanced by Mandler, Sara- 
son and their colleagues in a series of publications since the early 
nineteen fifties (particularly Mandler and Sarason, 1952; Sarason, 
Mandler and Craighill, 1952; Sarason, Davidson, Lighthall, Waite, 
and Ruebush, 1960) hypothesizes that anxiety is attached to and 
aroused by specific situations and that it is more valid to measure 
anxiety by items pertinent to particular situations than by items 
which purport to measure anxiety in some general way. In line 
with this theory the Test Anxiety Questionnaire (TAQ) was de- 
veloped to measure anxiety aroused by evaluative or testing situa- 
tions (Mandler and Sarason 1952). 

The TAQ has undergone a number of modifications, both of 
content and scoring, since its first inception. The most commonly 
Used version is one which contains three sections: the first dealing 


With anxiety about group intelligence tests, the second dealing with 


anxiety about individual intelligence tests, and the third dealing 
pecificity theory 


With anxiety about course examinations. If the s 
Were to be followed through to its logical conclusion, one would 
expect that the three sections would each be scored separately 
and that then these scores would be compared with the appropriate 
criterion variable; i.e., the score on anxiety about group intelligence 
tests would be compared with the results of group intelligence 
tests, and the score on anxiety about course examinations would be 
Compared with examination results. Historically, however, this 
Seems not to have been the scoring system used, and instead a total 
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score on all three sections has been employed as the е 
variable. The use of such a total composite score can be critie 
on the same grounds that are used to distinguish specific an 
e.g, test anxiety from general anxiety. If anxiety is really 
to situations, then one would expect that the most appropriate 
items for predicting the effects of anxiety on marks in achievemes 
examinations would be items which deal specifically with response 
to taking course examinations, and a similar argument could be ad: 
vanced for anxiety about intelligence tests. Theoretically, a 86008 
obtained by adding together the scores on different kinds ‹ 
anxieties, even if they are all classed generally as test an cieties 
should not be so efficient in predicting a given criterion, ав 
which is specific to the criterion situation. 1 
Purpose. The present study was concerned with comparing t 1 
concurrent validity of each of the three sections of the TAQ, рїш 
its composite total, against the criterion of cumulative grade poimi 
average (CGPA) in college academic courses. The use of CGP, 
was intended to capitalize оп a broader sampling of the course 
examination anxiety domain, as compared with, say, a single веш 
ter grade on a particular course, in which there could be 
effects of chance and biased sampling. 
The following hypothesis was tested: Of the four scores poss 
ble on the TAQ, taking each section by itself, and the total score 0 


Western Ontario, Canada, all of whom had just been graduat 
with a Bachelor’s degree. 

Procedure. Each student was given the TAQ to complete 
regular class meeting, with instructions which advised him 
the questionnaire was part of a research project on test an 
The college transcript from each student was obtained, and 
cumulative grade point average calculated. 

Scoring. There are a number of scoring systems for the TA 
this study each item was divided into five sections, and the 
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marked which of the five he felt applied to him. The seore for eseh 
section was the sum of the ratings for the items in that section. The 
total score was the arithmetic sum of the three sections, 

The product moment correlations between the sections and the 
cumulative grade point averages are shown in Table 1, together with 
the multiple correlation coefficient (R) using each subsection in а 
three variable predictor regression equation. 

In all four groups the highest correlations occurred, according to 
the hypothesis. Anxiety about Course Examinations correlated most 
highly negatively with Grade Point Average. All these correlations 
between CGPA and Course Examination Anxiety were significantly 
different from zero, a fact which was not true of the Total Score 
Correlations. Only one of the latter types of correlation reached 
significance, that of the Minnesota male sample. The correlations 
for the sections on Anxiety about Groups and Individual Testing 
were with one exception not significant. Caleulation of the multiple 
correlation coefficient (R) predicting CGPA from the three subsee- 
tions treated as independent variables, showed that R was num- 
tically greater than the correlation for total score in each instance. 
Of course the contribution of Course Examination Anxiety to the 
multiple R was very high. 

Discussion. The hypothesis that Anxiety about Course Examin- 


Product Moment Correlations of Cumulative 
р ionnaire (TAQ) Total and Subscale Scores 
Test Anxiety roc Ea h 
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Males Females 
we с _ су 
Anxi Section Minn* Сап» Minn* # 

T end &bout Group IQ Ts -12 -12 —13 
Anxiety about Individual ت‎ 
„19 Tests i TN —07 0 = 

Е Nt about Course aree —29* NU zm 
Multiple Re io 2 Н -19 

0 —37 = 
ma 0 

Vim 
* Based on treating the three sections of the TAQ as independent 
< 05. 
=p < 01. 
P < .005. 
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ations would correlate most highly, negatively, with cum 
grade point average was sustained by the results. The spe fici 
theory of test anxiety was given further credibility by the findit 
It appeared that even within the test anxiety domain, there w 
different kinds of test anxieties and that a test for one kind did not 
necessarily correlate with a test for another kind. An impor 

consequence of the results is that researchers who have been 
tent to use only the Total Score of the Test Anxiety Questio 
as their measure of test anxiety might possibly have reach 

roneous conclusions about the validity of the scale. If the criter 
measure, e.g. intelligence test performance, is correlated with € 
appropriate section rather than with the total score, signi 

results might be obtained. Alternatively a multiple correlation us 
the three subsections as independent variables in a regression 
tion would be preferable to using the total score. Sassen 
pessimistic comments (Sassenrath, 1967) on the validity of 
Test Anxiety Questionnaire might be re-examined in the lig 
these suggestions. 
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HOSTILITY AND LEARNING: А FOLLOW-UP NOTE 


FRANK COSTIN 
University of Illinois at Urbana-Champaign 


A recent study (Costin, 1970) found significant negative correla- 
tions between examination scores of 50 male studenta in an intro- 
ductory psychology course at the University of Ilinois, and their 
precourse hostility scores, as measured by the Scrambled Sentence 
Test (Costin, 1969). Without necessarily implying а direct caus 
and effect relationship, the investigator interpreted the role of hostil- 
ity in this context as an “interference” with learning. Additional 
data gathered independently of this study were consistent with such 
an inference: end-of-semester grade point averages of male students 
enrolled in the Special Educational Opportunity Program s the 
University (№ = 129) were found to be negatively correlated with 
presemester scores оп the Scrambled Sentence Test. 

The purpose of the present study was to discover whether the 
Scrambled Sentence Test might also be а negative predictor of 
achievement in a very different kind of educational setting—s 
highly technical course at a military installation. If во, the infer- 
ence that hostility may interfere with learning would gain greater 
plausibility. 5 

Method. The subjects were 60 enlisted men st an Air Foree 
Technical Training Center. They were enrolled in a 16 weeks course 
dealing with principles of meteorology : 
observing к recording weather phenomena. Teaching-learning 
activities included lectures, demonstrations, 
Sion and assigned readings. Course achievement was evaluated 
With objective examinations and practical performance p ron 

The Scrambled Sentence Test was administered to all m 
the beginning of the course. Prior to entering the course they 
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completed the Air Force Qualification Test, a group 
general mental ability. И 
The Scrambled Sentence Test was the only measure the investiga- 
tor was permitted to introduce into the classroom procedures; fur- 
thermore, no pretest of knowledge or skills concerning weather ob- 
servation was being used by the instructional staff at this time 
Thus, whatever advantages or disadvantages such precourse infor- 
mation might reflect could not be assessed. However, the investigator 
did not consider the lack of such information as crucial for demons 
strating additional evidence concerning the relationship between 
hostility and course achievement, since one might reasonably assume 
that individual differences in knowledge of weather observation 
principles and techniques were probably minimal at the beginm ш 
of the course. It was also assumed that failure to control for sue] | 
precourse knowledge and skill need not necessarily preclude 
terpreting end-of-course achievement as “learning,” even the 
conventional definitions of learning ususally incorporate the ¢ 
cept of “change.” As Bereiter (1963) has observed, “many of 
situations in which people change are not situations in whic 
change is a meaningful variable. Situations involving uniform — 
training procedures aimed at bringing subjects to a certain terminal - 
level of performance аге of this type . . . [pp. 13-14]" (Bereiter 
used an electronics course to illustrate the point.) 
Results. Table 1 shows the intercorrelations of scoes for hosti 
course achievement, and mental ability. As the table indici 
the zero-order correlation between hostility and achievement 


TABLE 1 


Iniercorrelations of Male Air Force Students’ Scores on Hostility Test, 
Scores on the Air Force Qualification Test (AFQT), 
and Total Course Achievement. (N = 50) 


Zero-order r 
Achievement Ability 


Partial r with achievement, 
ability held constant 


Hostility —.41* 2118 uer 
AFQT :61* не 


Note.—Hoetility was measured with the Scrambled бе in, 1 
. tence Test, Form C (Costin, 
the higher the score, the greater the hostility. (Maximum Ко Eus as 30; mean 
25 - аваа жаз bared оп total number of points accumulated on written 
formance tests. (Maximum possible score was 300; = = 
"pps енн mean = 251.1, SD = 12.0.) 
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was —39 (p < 01). This finding is consistent with that obtained 
for the 50 male students in the previous study of introduetory 
psychology; in that instance, with ability held constant, the partial 
correlations between Scrambled Sentence Test scores and achieve- 
ment scores on two objective examinations of principles and other 
empirical generalizations were —41 and —45 (p <. 01). (When 
both ability and precourse knowledge were controlled, the r's were 
—40 and — 44 respectively.) 

The fact that the Scrambled Sentence Test was а negative pre- 
dictor of course achievement in these two different teaching-learning 
situations lends further support to the possibility that hostility 
tends to interfere with learning. For the present this inference 
should be restricted to men, since the correlations between achieve- 
ment and hostility scores for women in the psychology course (N 
= 51), while negative, were not significant (p > .05). Although the 
data of the psychology study suggest that hostility may be less of 
an interference for women than for men, such a conclusion must be 
held in abeyance, since tests of significance for differences between 
the rs yielded probability values greater than 05. Investigations 
are now under way to discover whether negative relationships be- 
tween hostility and learning may indeed be greater for men than 


for women. 
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REDICTING THE BEHAVIOR OF INSTITUTIONALIZED 
DELINQUENTS WITH—AND WITHOUT—CATTELL’S 
HSPQt 


VERNON О. TYLER, JR. 
Western Washington State College 


ROBERT F. KELLY 
Spokane, Washington 


Тнв High School Personality Questionnaire (HSPQ) is one of 
family of tests developed by Cattell and his co-workers (Cattell, 
eloff, and Coan, 1958) to measure 14 personality factors of 
ildren ages 12-17. 

There has been considerable research on the factorial validity of 
ese tests (e.g., Cattell, 1957; Cattell and Scheier, 1961) and efforts 
ive been made to match their personality factors to rating data 
-E, Becker, 1960; and Schaie, 1962). However, Vernon (1965) 
ressed the need for validation against а variety of external 
iteria. There have been a few normative studies on delinquent 
Pulations (Pierson and Kelly, 1963a, b; Stern and Grosz, 
69). The HSPQ has improved the prediction of academic suc- 
55 (e.g., Butcher and Gorsuch, 1960) and selected delinquents from 
general population? However, this test has seldom predicted be- 


"pap Ө 

Е were collected when the authors were at the Fort Worden Diag- 
a pa Treatment Center, Port Townsend, Washington. Fort Worden is 
pu ltution of the Washington State Department of Institutions, Division 
e ees Rehabilitation. Many thanks are due Gus Lindquist, Superin- 
eek and Assistant Superintendents Robert Tropp and Robert Koschnick 
a T support and encouragement of this research. Appreciation is also 
tin Weed cottage staff who gave their time to complete the diagnostic 
a S ortions of this paper were read at Western Psychological Associa- 
n, San Diego, March, 1968. 

Н, Scheier, personal communication, January 11, 1962. 
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havior within a delinquent population. Pierson (1964) and Pier- 
son, Cattell, and Pierce (1966) showed the HSPQ and other Cattell 
tests are sensitive to personality and academic changes in in- 
stitutionalized delinquents. Tyler and Kelly (1962) found that 
HSPQ's administered in a diagnostic center for court-committed 
delinquent boys predicted ratings of behavior of these youths in 
treatment institutions several months later with specification- 
equation “multiple R’s,” (Cattell, Beloff, and Coan, 1958) ranging 
from .18-.55. 

The present study investigated the efficiency of the HSPQ and 
diagnostic ratings in predicting inmate characteristics. 

Procedure. Subjects were 168 male offenders ages 14—18 housed 
in a state diagnostic center. Forms A and B of the HSPQ (1958 
edition) were administered to these boys and they were rated by 
cottage staff who knew them for several weeks on 16 dimensions 
covering such behaviors as table manners, group functioning, work 
habits, hostility to staff, and maseulinity (See Table 1). Because of 
staff and inmate turnover and scheduling, the raters were not 
trained, and the ratings were not made on all boys on one scale at 
2 time, nor by the same number of raters; consequently, the 
reliability of these ratings probably was attenuated to an unknown 
degree. Several months later at four forestry camps, the boys’ camp 
counselors® rated them on these 16 dimensions and nine others, 25 
criterion dimensions in all (See Table 1). These 25 rating scales were 
originally developed around characteristics the camp counselors 
considered most often in describing inmate behavior. Efforts were 
made to word the scales in the counselors’ own language. At the 
time of administration, the counselors knew the boys rated quite 
well. Their only instructions were to attempt to use all categories ОП 
the scale in a pattern at least resembling a normal distribution. 
Reliabilities of tests (using raw scores) and ratings were calculated‘ 
(Ebel’s intraclass correlation; Guilford, 1954) and multiple re 


TABLE 1 
Diagnostic Center and Forestry Camp Rating Scales 
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Scale No. 

1. Excellent table manners. 123456789 Disgusting table manners. 

2. Functions well in а 123456789 Does not function well 
group (3 or more in any group. 
persons). 

3. Not accident prone. 123456789 Very accident prone. 

4. Always tells the truth.* 123456789 A regular liar. 

5. Feels guilty when does 123456789 Does not feel guilty when 
something wrong. does something wrong. 

6. Always does a good 123456789 Never does a good day's 
day’s work. work, 

7. Not hostile to staff. 123456789 Extremely hostile to staff. 

8. Does not usually foul 123456789 Seems determined to mess 
himself up. himself up. 

9. Pretty open and 123456789 Can't trust him out of my 
&boveboard. sight; very sneaky, always 


up to something. 
10. Satisfactory adjustment 123456789 Unsatisfactory adjustment 


Il. Never have to be firm 123456789 Got to be really tough 


on work crew.* on work crew. 
with him to keep him with him to keep him 
in line. in line. 

12. An adequate placement 123456789 No adequate parole 
(or foster home) exists placement exists at this 
for him when paroled. time. 

13. A very likeable kid.s 123456789 A very unlikeable kid. 

М. Calm 123456789 Very nervous. 

15. бооа parole bet; 123456789 Very poor parole risk; 
will make it, won't make it. 

16. Very manly—can stand 123456789 Areal “mama's boy." 


9n his own two feet. 


17. Attains long term goals 123456789 Canteven attain a short 
| (several months ahead), term goal (one or two 
days). 


18. А nice looking kid.s 


1 Ugly kid. 
19. Trusts Staff. I 

1 

t 


9 
20 t 9 Does not trust staff at all. 
- Never hits or pushes. 9 
21. Talks freely in 9 
22 counselling sessions.* sessions. 
- Seldom picked on by 1 9 Frequently picked on by 
23 Mr boys. other boys. 
* Thinking seems О.К. 123456789 Thinking seems pretty 


Often hits and pushes. 
“Clams up” in counselling 


crazy. 


М. Satisfactory adjustment 123456789 Unsatisfactory adjustment 
In camp (not on work in camp (not on work 
25 usu erew). 
: Sexually normal. 123456789 Abnormal sexual behavior. 


* Indicates scales used in forestry camps only, 
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gression equations set, up® for prediction of scores on each of the 25 
camp rating scales using the Wherry-Doolittle test selection method 
with the Wherry shrinkage formula (Garrett, 1958). 

Results. The equivalent forms reliability coefficients for the HSPQ 
with the Spearman-Brown correction for length” ranged from .33 
to .68 for the 14 test factors. 

On the 16 diagnostic center rating scales, some boys were rated 
by as few as three raters and others by as many as eight. Problems 
with this rather untidy data have prevented the computation of 
reliability coefficients; it is estimated that they would range from 
-50 to .80. Reliability of the mean criterion ratings ranged from .50 
to .92 with а median of .86. 

Multiple R’s corrected for shrinkage were calculated for predicting 
camp ratings with the HSPQ, with diagnostic center ratings and 
with a combination of HSPQ and diagnostic center ratings. The 
HSPQ alone predicted with E's ranging from .17 to .41 with a 
median of .30, with all but one E significantly greater than zero 
(р < .01). Diagnostic center ratings estimated camp ratings with 
R's ranging from 29 to 55 (median R — 49) with all R’s 
Significant (p < 01). Combined HSPQ and Diagnostie Center rat- 
Ings produced R’s from 84 to .61 (median R = .53) with all R's 
Significant (p < 01). 

The data show a clear-cut trend: while the 25 mean camp ratings 
Were predicted by the HSPQ alone with fairly sizeable R's (median 
R = 30), the validities of the diagnostie center ratings were con- 
siderably higher (median R = 49); and the combination of HSPQ 
ps diagnostie center ratings was the highest of all (median R = 


Discussion, The obtained equivalence reliability coefficients for 


With one exception, all 75 Rs computed were highly significant 
teresting, however, was the clear predictive 


Se 
5 Acknowledgement and а iati oí the 
Mathematics Department Án uh M Coo Rowley 


the Computer Center, Western Washington 
State College, for calculating the multiple regression 'equations. 


f Complete data for the Study may be obtained from the senior author. 
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superiority of the diagnostic center ratings over the HSPQ. It ap- 
pears that for the practical task of predicting treatment institu- 
tion behavior, diagnostic rating shows more promise than diagnos- 
tie testing. The slight gain in predictive power produced by com- 
bining tests and ratings hardly seems worth the cost of testing. 

At best though, the predictive efficiency of any of these procedures 
is lov. With R’s in the low .50’s, only about 25 per cent of the 
variance in the camp ratings is accounted for. However, with further 
work, the predictive power of the diagnostic center ratings could 
probably be improved to the point of genuine diagnostic usefulness. 
| With inmates remaining in diagnostie cottages for short stays of 
- six weeks or less, the problems of rating are difficult. However, with 
trained raters and the rating of all boys in the cottage on one scale 
at a time (as was done in the forestry camps), improved predictions 
should result. 

Of course, it must be noted that since for each R, a small number 
of predictor variables (2-8) was selected from a large pool of 
variables (14-80) the danger of shrinkage was present. 

Cross validation is needed, but even more important would be a 
replication of the study with improved diagnostic ratings. 
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INTEREST PROFILES OF CLERGYMEN AS INDICATED 
BY THE VOCATIONAL PREFERENCE INVENTORY 


DAVID L. SCHULDT 
Wesley. Foundation, Iowa City 


ROBERT F. STAHMANN 
'The University of Iowa 


Tue literature related to the interests and personality patterns 
of clergymen does not include a report of the use of the Vocational 
Preference Inventory (VPI) with pastors (Schuldt, 1970). John 
Holland, the developer of the VPI, has reported an interest pattern 
for clergymen on his instrument (Holland, 1969). The purpose of 
the present Study was to validate Holland's reported pattern for 
clergymen on the VPI with a sample of active United Methodist 
pastors. 

Instrument and sample, Holland has proposed that the choice of 
a vocation is an expression of personality. The VPI consists of eleven 
Seales which measure vocational interests and aspects of personal- 
ity. The first six scales (Realistic, Intellectual, Social, Conventional, 
Enterprising and Artistic) measure specific interests and relate 
them to occupational environments. The remaining five scales (Self- 
Control, Masculinity, Status, Infrequency and Aequiescence) yield 
Information about other aspects of the subject’s personality (Hol- 
land, 1965), 

The clergymen used in the study were randomly selected from 
among United Methodist clergymen serving Iowa churches as pastors. 
he, the purpose of the study only those men who had served fewer 

an 15 years in the pastorate since their ordination were selected. 

© mean age of the respondents was 39.4 years. Seventy-three 
Percent (N = 55) of the pastors sampled (N = 75) returned the 
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questionnaire materials which were administered by mail. À control 
sample of 105 employed adults, most of whom were college gra- 
duates, was drawn from the VPI Manual (Holland, 1965). 

Results. Directions for VPI interpretation indicate that the 
highest score represents а dominant personality type and the four 
highest scores form a personality and interest pattern. Holland has 
reported a hierarchical pattern for ministers on the VPI of SAIE 
(Social, Artistic, Intellectual, Enterprising) (Holland, 1969). The 
VPI interest pattern for the clergymen sample tested in this study 
was also SAIE, which supported Holland’s prediction. Table 1 
Summarizes the data for two groups: pastors and employed adult 
males. As can be observed from Table 1, the pattern for pastors is 
SAIE while that for employed adults is EICS (Enterprising, 
Intellectual, Conventional, Social). 

The pastors’ highest score on the VPI was on the Social scale; 
thus, they were of the Social personality type. The VPI Manual 
interprets high scores for males on this scale to indicate persons who 
are responsible, accepting of feminine impulses and roles, and are 
“facile and insightful in interpersonal relationships.” Such per- 
sons have the ability to form “close” as opposed to “superficial” 


TABLE 1 
Differences between Vocational Preference Inventory Scores for Samples of United 
Methodist Pastors and Employed Adults 


Pastors Employed Adults* t-test 
SAIE EICS 
(N = 55) (N = 105) 
Mean SD Mean SD 
І Realistic 2.91 co 
2. Intellectual оте T0 26 3405 
3. Social 8.13 4.00 pa 4.0 41" 
4. Conventional 2.25 315 44 3.5 3.81** 
5. Enterprising 3.60 3.36 8 р 1 8 : 1 8.47" 
6. Artistic 7 10 45 a8 12 
YO Self-Control 10.38 3\85 9.3 3.2 1.89 
8. Masculinity 6.58 2.09 9 1 2 "n 12.467 
9. Status 8.07 2.96 9.8 2.0 OR 
10.  Infrequeney 6.04 3.97 4.0 2.5 5.69** 
ll.  Aequiescence 10.22 5.15 13.4 5.4 = 3.50% 
Lobo as from the VPI Manual, p. 34, 
**p < .001. 
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relationships. They have been described as valuing social and 
religious achievement. 

The profile of the pastors’ group with the Artistic scale second 
indicated that pastors have artistic, musical, and literary interests 
and also value having a philosophy of life. The rank order of the 
two remaining scores obtained by the pastor sample support the 
description of the Social personality type for this group. 

It was not surprising that pastors scored lowest on the Realistic 
and Conventional scales. Unlike pastors, the realistic type person 
tends to be mechanically oriented with low social interests and has 
an aversion for problems requiring a sensitivity for feelings. The 
Conventional type person achieves his goals through subordinate 
roles and by conforming and ordering his life according to pre- 
scribed ways of behavior. The scores obtained by the clergymen 
sample on remaining five VPI scales (Self-Control, Masculinity, 
Status, Infrequeney, Acquiescence) further substantiated the ВАТЕ 
pattern, 

Summary. Pastors may be described as most sensitive to personal, 
humanitarian, social, and emotional influences. They are least 
Sensitive to materialistic, influences and roles which require struc- 
tured, conforming behavior. Such a description is, of course, a 
generalization, but it is interesting to note its similarity both to 
Persons in the pastoral ministry and to the descriptions of desired 
characteristics of persons entering the ministry (Department of 
Ministry, 1969). 

Most of the work of the average pastor is in direct relationship 
to persons and their needs (Social), but it also includes creative 
and innovative leadership (Artistic), study and teaching (Intellec- 
tual), and administration (Enterprising). Hence, the picture of 
clergymen (SAIE) generated from Holland’s (1966) theory and the 

T appears to have some merit. 
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This book is designed to serve as a source book of information and 
procedures for counselors, teachers, and other persons concerned with 
testing and as a textbook for students in these areas. This reviewer 
feels that this book does adequately meet some of these objectives, 
but is inadequate in meeting others. : 

The book is basically organized in three general sections. The 
first section, Chapters 1 to 3, dealing primarily with. background 
and methodology in testing; the second section consisting of Chap- 
ters 4 to 7 are primarily concerned with measurement in the 
cognitive field and the third section consisting of Chapters 8 and 9 
emphasizing affective measurement. 

Chapter 1, of the book, devotes itself to background and resource 
information for testing. The first section of the chapter gives a very 
brief, but sufficiently adequate overview of the history of measure- 
ment. It is felt by this reviewer that the section on “Sources of 
information” is too brief. The best that a textbook in this gen- 
eral field can hope to achieve is an overview of standardized tests 
that are available with encouragement to the readers to seek more 
detailed information in specific references devoted to standardized 
tests. A table of contents of the Sirth Mental Measurements Year- 
book, quoted on page 7, does give the student knowledge of the vari- 
ous fields for which standardized tests are available. 

According to the preface “no previous exposure to statistics is 
assumed.” Yet, it appears that this book in its section on statistical 
methods in testing tries to go too far too fast. For example, by the 
sixth page of the discussion on statistical methodology the summa- 
tion operation is described including the use of sigma complete with 
superscripts and subscripts. Four pages later the reader is asked to 
calculate a coefficient correlation based on summation operations 
and is given a formula with seven summation signs. A passing 
glance at that formula would probably discourage most of the 
readers for whom this book is designed and appropriate. 

. Chapter 2 concerns itself with the preparing, administering, scor- 
ing, and evaluating of tests and test items. This chapter is a very 
thoughtful and concise discussion of the problems and techniques in 
test preparation. It appears to be an excellent background for un- 
erstanding the problems of test preparation, but does not appear 
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to be adequate for instructing the classroom teacher in the prepara- 
Чоп of classroom tests. 

Chapter 3 concerns itself primarily with the matter of the char- 
acteristics of satisfactory measuring instruments such as the ques 
tions of reliability, validity, standardization and norming. This 
material appears to be very adequately covered and at a level of 
sophistication appropriate for the group for which this book is de- 
signed. One particular aspect that was appealing to this reviewer 
was а brief, but very adequate, discussion of expectancy tables and 
а nomograph for predicting grade point averages. This type of ap- 
plied material is especially appropriate for the audience to whom 
this book is directed. 

The next two chapters are primarily an overview and directory 
of standardized achievement tests and individual and group tests 
of intelligence. Here the author has made what appears to be an 
excellent and very up-to-date selection of the more widely used 
tests in these areas with a brief annotation and evaluation of these 
tests. The material presented here gives a satisfactory introduction 
to the use and purposes of tests of this nature and the annotation 
is appropriate for the non-professional test user who wants to find 
some information about the more common tests used in the schools. 

Chapters 7, 8 and 9 are similar to the two chapters mentioned 
above except that they cover the tests of special abilities, measures 
of interest, attitudes, and personality. As in the two previously 
mentioned chapters these three chapters serve as an excellent basie 
quick reference to up-to-date evaluation in these areas. 

Research and theories on general intelligence is the subject matter 
of Chapter 6. Again, it appears that this is a superior overview 
and review of this area and at a level of sophistication appropriate 
for the individuals for whom the book is designed. It does appear 
that sufficient distinction between the ratio IQ and the deviation 
1Q is missing throughout all discussion of intelligence and intelli- 
— measurement in the book. The final chapter on current issues 
i! 1s an excellent summary of the modern day 

ms in testing and a very realistic forward look to what could 07 
might be happening in the area of testing. 

Overall the greatest strength in this book lies in well written 
up-to-date annotation and summary of the more common types of 

м ¢ devices currently in use in the schools. This does pro- 
vide a valusble quick reference for the nonprofessional test usef 
in assisting such persons in understanding the results obtained from 
testing programs as well as assisting in their communication witl 
ПЕ ee users. 

useful supplement to the text is a Study Guide containing 
chapter summaries, self-testing, exercises, lists of terms, and names 
of tests. An Instructor’s Manual composed of parallel readings; 
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ef audio-visual materials, and test items for each chapter is also 
available, 

The reviewer can recommend this book as а textbook or a 
weuree book for background information or training of the non- 
professional test user such as the classroom teacher and the school 
administrator who need to have understandings in this area. 


ROBERT M. COLYER 
Duke University 


Benjamin 8. Bloom, J. Thomas Hastings and George F. Madaus. 
Handbook on Formative and Summative Evaluation of Student 
Learning. New York: McGraw-Hill, 1971. Pp. iii + 923. $11.95. 


The scope and depth of treatment given to educational evaluation 
in this volume accounts for the delay in its publication. Quite wisely 
snd courageously the authors relate evaluation to developments in 
curriculum and instruction and to the whole changing social setting 
in which evaluation is to be applied. 

The organization of the book is reminiscent of the 1945 NSSE 
Yearbook, on The Measurement of Understanding. Substantial 
introductory sections of general application (280 pages) are followed 
by а doubly long section of eleven separately authored chapters 
devoted to statements of objectives and their evaluation in pre- 
school education, language arts, secondary school social studies, art, 
science, secondary school mathematics, literature, writing, second 
language learning, and industrial education. The area of preschool 
education is treated in two chapters, one devoted to evaluation of 
socio-emotional, perceptual-motor, and cognitive development, and 
the other covering early language development. 

Although the authors properly point out that this is a handbook, 
hence not to be read cover to cover, the reader may well ponder 
the opening chapter giving their “View of Education.” They forth- 
rightly declare a belief in the fundamental teachability of all chil- 
dren, arguing the considerable modifiability of social-class-linked 
ing factors of standard language development, motivation to 
secure maximum education, willingness to work for teacher approval 
and/or long-term goals, and acceptance of school learning tasks 
"with a minimum of rebellion." With evaluation turned from 
selection to development, a modern version of the “plan-test-teach- 
test-plan” model gives a place for not only fuller specification of 
instructional objectives in terms of behavioral outcomes, but in- 
Sertion of formative evaluation and feedback as intervening steps 
In the “test-teach-test” sequence. Their argument for the primacy 
of structure of the learning process over structure of a subject seems 
Moot in that there may be more need among their readers for those 
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concerned with process to be attentive to subject structure than vice 
versa in the present cycle on that issue. 

Learning for mastery is given the full treatment it deserves. This 
reviewer counts himself among those strongly influenced and helped 
by this approach in his instruction. Life is a work-limit rather than 
a time-limit situation largely, and giving students as long as they 
need and are willing to spend to master the content of most courses 
is produetive in student learning. This is true regardless of whether 
the learning units have been or can be broken down all the way 
into blocks or elements for individual mastery. Cooperative effort 
by teacher and student to achieve fixed goals is an integral part of 
mastery learning, fitting it conceptually into the “View of Educa- 
tion” already presented, 

The second major section on “Using Evaluation for Instructional 
Decisions” might be subtitled “The Humane Use of Tests.” The 
need for summative, certifying or grading evaluation is not blinked, 
but it is put in a framework of teacher as helper to mastery by 
subsequent chapters on “Evaluation for Placement and Diagnosis” 
and “Formative Evaluation.” These latter chapters expand to full 
treatment their introduction earlier in the discussion of formulating 
objectives. One wonders whether the present shift from summative 
and predictive evaluation may not some day make it more natural 
to present these topics in their proper chronological sequence of 
а diagnosis, formative feedback, and summative evalua- 

ion. 

A third major section (105 pages) relates evaluation procedures 
to the several categories of the Taxonomy of Educational Objectives, 
Handbook 1—Cognitive Domain and Handbook 2—Affective Do- 
main, by Bloom, Krathwohl, et al. It begins on a note of whole- 
Some respect for knowledge outcomes, noting how these outcomes 
still are basic in the best curriculum outlines and guides. The logic 
of the several levels of cognitive outcomes: knowledge, compre- 
hension, application, analysis, synthesis and evaluation is clearly 
spelled out and well illustrated with items from college and acade- 
mic high school levels, Readers will generaly be concerned with use 
of this framework in their Subject areas, so may be directed to the 
separate chapters of Part 2 for illustrations more directly useful in 
Preparing evaluation devices for their courses. One could wish for 
a broader set of references at many points. Creativity is discussed 
under “Synthesis” without mention or listing references to Guilford 
or Torrance. A warning against narrow interpretation of “Hvalua- 
tion” as Including mere preference is given without reference 10 
the definitive research of Kropp and Stoker, And one misses апу 
reference to the problem of presenting test; exercises to young, slow, 
or foreign students less adept in the nuances of language taken for 
granted in the illustrations. 


000 n cA 
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Chapter 10 on *Evaluation Techniques for Affective Obj ectives" 
presents a strong case for including evaluation of the achievement 
of such objectives. In response to the oft-voiced fear of brainwash- 
ing or invasion of privacy through grading of affective behavior of 
children, the authors point out that no summative grades need be 
given (or recorded) individually. Rather, formative evaluation may 
be used to feed back constructive guidance to individuals, while 
anonymous group data may be used formatively and summatively 
to evaluate the curriculum and instruction. The very real problem 
of enlisting individuals in their own improvement and the problem 
of the “socially desirable” answer remain, but the “intangibles 
need to become thus much more tangible, rather than left as vaguely 
hopeful long-term outcomes. 

Short, efficient chapters are devoted to developing the technology 
of evaluation systems and emerging developments in evaluation. In 
the first of these, total evaluation systems are conceived as basic to 
helping the teacher help students learn by organizing a supporting 
technology and specialists to give leadership and guidance in its 
use. The second chapter seems unduly eclectic, quoting extensively 
from National Assessment, but omitting the systematic work of 
Guba and Stufflebeam and the whole concept of accountability. 

The subject chapters deserve separate review by separately com- 
petent specialists. Suffice it to say here, that this reviewer found the 
two chapters on preschool education especially rich with detailed 
illustration and definite comment. Kamii’s chapter has the special 
merit of being based on an experimental program in which a Piaget- 
ian curriculum is struggling to be born. Cazden's chapter happily 
concentrates on new insights of psycholinguisties as applied to the 
fundamental problem of compensatory education; her comments on 
the use of specific standardized tests are partieularly cogent. 

It is also significant that “Evaluation of Learning in a Second 
Language" is presented by one who has taught English as a second 
language and sees foreign language instruction and its evaluation 
in that context. Some of these insights, combined with the observa- 
tions on preschool language development, might illiminate the more 
traditional breakdown into language arts, literature and writing in 
the typical English curriculum. 

_ Unfortunate omissions are chapters presumably originally planned 
In elementary social studies and mathematics. The chapter on 
Secondary school social studies suffers most because of the still 
prevalent tendency to equate social studies at that level with history 
and government. Elementary social studies is a ferment of disci- 
plines, including rediscovery of geography. The chapter on secondary 
School mathematies will interest many because of the adaptation of 
the Bloom taxonomy to fit mathematics objectives. The original 
taxonomy has had the virtue of showing that one reason mathema- 


1036 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ties has been so hard to many students is the failure until recently 
to reward anything below problem-solving as achievement worthy 
of noting. 

At one point in writing this review, your reviewer was prepared to 
organize the AAHSB (Association against Heavy Slippery Books). 
Not only is such a book difficult to read in the bathtub, but its weight 
discourages its use as a textbook. However, as he contemplated 
the cost of printing a main volume with paperback chapters on the 
specifie subjects, it became obvious that a single-volume edition 
was essential if costs were to be kept in bounds as they have been. So, 
here's to a truly elegant volume, in the best sense of that adjec- 
tive. It is a reference that should stand long despite the shifting 
sands of curriculum and society. 


Warren С. FINDLEY 
University of Georgia 


William W. Cooley and Paul R. Lohnes, Multivariate Data Analysis, 
New York: John Wiley & Sons, 1971. Pp. x + 364. $9.95. 


Multivariate Data Analysis is an appealing title for a book. Espe- 
cially for those persons who are familiar with John W. Tukey’s 
essays on data analysis, this title 18 apt to conjur up visions of a 
revolutionary text on multivariate methods. If contents of a book 
were to adhere closely to Professor Tukey’s thinking they would, 
in relation to contents of other extant books, be more strongly and 
More persistently oriented towards the scientist's working ques- 
tions—such as: Given my data, just how have I added to my 
knowledge of the phenomena under study? How might my data be 
more productively gathered and analyzed at successive stages of re- 
search? How might my motivating questions be most effectively 
sharpened for the collection and analysis of further data? Etc. 
Tukey has argued for several years that there is a place for à 
Science called data analysis, where subject-matter questions are 
held to be preeminent, where forma] methods and models for in- 
ference are subordinated to informal ones, where mathematical 
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tially conventional. The text consists of three Parts which have 
been further divided into 13 chapters. Part I (two ehapters) in- 
cludes a general overview and a very brief introduction to vectors 
and matrices. Part II (five chapters) is entitled “Studies of a Single 
Population," with chapters covering partial, multiple and canonical 
correlation as well as principal component analysis. Part III (six 
chapters) is entitled “Multiple Population Studies"; multivariate 
analysis of variance, discriminant function analysis (and classifica- 
tion procedures), and general strategy considerations are offered. 
The entire approach is intended as introductory; the reader is 
assumed to be an applied researcher who is at least familiar with 
classical univariate procedures, There are 60 pages of FORTRAN 
program listings as well as a brief FORTRAN primer ‘о whet 
your appetite and give you courage” (p. 26). For persons who have 
seen the authors’ 1962 volume, Multivariate Procedures for the 
Behavioral Sciences, the present book will appear essentially as a 
revised and expanded version. 

On the positive side, the authors have clearly improved on their 
1962 effort. This version is physically attractive in both format and 
style of print and suffers from few misprints (although some minor 
slips were made in bold print expressions on pp. 59, 141 and 177 
and a table is misaligned on p. 134). Several of the expository and 
matrix-based discussions of methods are reasonable, as far as they 
go, and the associated FORTRAN programs rely on many of the 
newer developments in numerical analysis. Taken together, the chap- 
ters of this book provide several examples and substantial com- 
mentary as to how multivariate methods may be used in behavioral 
and social science research; perhaps for many people this is enough 
to merit a recommendation. 

Unfortunately, а number of points must be made on the negative 
side. Some of the problems seem to require little commentary. Ex- 
amine the following statements in relation to one another: 

“Most of our procedures are concerned with estimating or making 
inferences about parameters of a m[ultivariate] n[ormal] d [istri- 
bution]" (p. 1). 

". . . for the most part the examples are from surveys. The 
emphasis on heuristic rather than hypothesis testing uses of the 
кше follows from this preoccupation with survey sciencing” 

p. 7). 

“By a model we mean the basic matrix algebra specification of a 
Procedure for analyzing data” (p. 10). 

“This book is not concerned very much with distribution theory, 
nevertheless it is worthwhile to take a look at some of the properties 
of a multivariate normal distribution . D5. 

Despite the disclaimers, dozens of t-, F- and A-statistics are to be 
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found, and some of the chapters contain several pages on tests of 
significance. Strange. 

Other anomalies: 

“We do not undertake . . . to fit higher order polynomials because 
it seems to us that even a finding of significance for a cubic term 
would so discourage the behavioral scientist that he would want to 
change his method of scaling one of his constructs, or change his 
research design, or his line of work" (p. 79). 

“A living human being is a highly integrated system, all the overt 
characteristics and behaviors of which are interrelated. If we want 
an uncorrelated vector variable, we have to construct it by trans- 
forming the data” (authors’ italics) (р. 97). 

Further problems are to be found in the chapters on component 
analysis. At several points the authors imply that statistical in- 
ferential tests are available for testing hypotheses on the basis of 
characteristic roots and vectors of correlation matrices. While it is 
true that certain tests are available for (Thurstonian) common 
factor methods (using maximum likelihood statistics) it has not 
been possible to develop formal hypothesis testing methods for com- 
ponents methods. Other confusions in these chapters have to do vith 
improper use of the term “communality” (see especially, p. 150); 
strictly speaking, this term has no relevance to component analysis. 
Also no distinction is made in the discussion of “rotation” methods, 
between primary and reference axis systems. This results in con- 
fusions between pattern and structure matrices. 

The book can be faulted for a total lack of sustained methodo- 
logical themes. A major example has already been noted—that the 
authors see themselves teaching Tukeyian data analysis, but that 
they continually fall back on classical significance tests and as- 
sociated baggage; and there is no mention of graphical plotting 
methods or general procedures for improving the fit of models to 
data. Tukey calls “fitting” the “workhorse of data analysis.” An- 
other problem in this context is that while all the methods are 
special cases of the general (multivariate) linear model, it is al- 
most impossible from this presentation to see most of the basie 
interrelationships. Besides missing several opportunities to synthe- 
size methods, it is rare to find discussions showing how a single 
method may have different uses, depending on the investigators 
questions. Tt does not help to find nearly all examples from the 
survey files of PROJECT TALENT. 

While this book may meet certain limited purposes, it should be 
clear that any recommendation must be substantially qualified. 
The title of the book seems particularly misleading. Forthcoming 
books by Tatsuoka (Multivariate Analysis for Educational Re- 
search; John Wiley and Sons), Bock (Multivariate Statis 
Methods in Behavioral Research; McGraw-Hill) and a recently 
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published book by Rummel (Applied Factor Analysis; Northwest- 
em University Press) may be worthy alternatives to this one. 


Бовквт M. PRUZEK 
State University of New York 
at Albany 


Меп L. Edwards. Probability and Statistics. New York: Holt, 
Rinchart and Winston, 1971. Pp. xvi + 257. $8.50. 


This moderate-sized text for introductory courses in applied 
probability and statistics is meant for students with adequate 
grounding in algebra but none in calculus. Its basie emphasis “is 
on proving equations and theorems that the student is ordinarily 
asked to take on faith . . .” Proofs are almost exclusively in terms 
of discrete variables in order to avoid the necessity for calculus. In 
addition to use as a text, Edwards recommends the book for sup- 
plementary reading in applied courses. It is obvious that this is the 
real purpose for which the book was meant, since the mode of 
presentation is relatively formal, the amount of exposition and 
linking material is generally minimal, and attention to direct ар- 
plications is practically nil. m 

In practice, this book could be used to supplement an intuitively- 
oriented text or lectures emphasizing intuitive ideas and applica- 
tions, by providing formal proofs of assertions. For such use it has 
several advantages. Foremost of these is the clarity and consistency 
which characterize Edwards’ writing style. Assumptions are always 
clearly and explicitly stated, proofs are developed in an orderly 
step-by-step fashion without mysterious jumps, and terminological 
and notational conventions are defined and then adhered to through- 
out. 

Of comparable importance are the range of topics and the care 
given to their selection. The book begins with the algebra of samples 
and proceeds to the ideas of sample spaces and probabilities de- 
fined upon them. Counting procedures follow and then discrete 
random variables and their expected values. After this Edwards 
develops the properties of random samples drawn from finite popula- 
tions without replacement; binomial variables and ranked vari- 
ables are then treated, followed by the Poisson, normal, chi-square, 
Student's £, and F distributions. The three final chapters deal with 
expected mean squares in one-way anova, power, and confidence 
intervals, respectively. Almost all topies which might be contained 
in the usual introductory course are covered. In addition, the 
chapters following the one on sampling without replacement from 
finite populations are relatively self-contained and could be used 
more-or-less independently. 
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Three aspects of this coverage seem especially good to me. Е irst, 
the distinction between population and sample standard deviations 
is made and maintained without fuss and without confusion. Second, 
the treatment of sums of simple random variables is thoroughly 
done, making possible an easy transition to notions of linear com- 
binations in later courses on measurement or factor analysis. Third, 
the Poisson distribution is developed and linked to other distribu- 
tions in an understandable way. 

Areas of strength notwithstanding, the book has several weak- 
nesses, It would be more in line with contemporary usage to de- 
Scribe experiments in the behavioral sciences as “modeled by” or 
“represented as” random experiments rather than saying they “are 
random experiments” (p. 26). Such a statement seems likely to 
blur what is, for introductory students, the already fuzzy line be- 
tween scientific method and mathematical models. 

A second fault, and one shared by most writers of statistical texts, 
is the failure to attribute important features such as the handling 
of particular topics to the men who originated them. Edwards’ 
treatment of probability owes a great deal to Feller; most chapters 
on probability in most texts on methods owe similar debts to Feller 
and the time has come to acknowledge them, 

On the whole, this text is well done. However, the purposes for 


simply aS à source of Supplementary material on the mathematical 
underpinnings of applied statistics, it should prove most satisfactory. 


James А. WALSH - 
Iowa State University 


Robert L. Thorndike. Educational Measurement. (2nd. ed.) Wash- 
ington, D.C.: American Council on Education. Pp. 768. $15.00. 


’ The first questions to be asked of а volume of this nature is who 
18 its intended audience and what is its general usefulness. This 
Volume falls somewhere between an encyclopedia and a mechanics 
manual, it contains compendiums of current facts, step by step 
how-to-do-its, and theoretical discussions of varying levels of 


The reviewer faced the same problem as the editor—how to 
evaluate such a broad coverage of the field. The list of authors and 
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those who read or contributed to the various chapters is а Who's 
Who in test and measurement. This reviewer, in need of consensual 
validation, turned to his colleagues, Clint Chase, Gerald Bracey, 
Richard Pugh and Sydney Mifflin, who read sections of the book 
and diseussed them with him. The reviewer, however, takes full 
responsibility for what appears here. 

The appropriate audience for this book is a diverse one and 
unhappily, to this reviewer one which does not include teachers and 
administrators. There are excellent chapters for the theoretician 
and the advanced student and some excellent sections for the test 
item writer, the printer, and the clerk in a testing department. 
However, a teacher turning to this reference work would find that 
the chapters which might be useful are overly drawn out, full of 
platitudes and display a surprising lack of sophistieation about 
children and schooling. For the most part the articles contain psycho- 
metrie theory by educational psychologists who are apparently 
much less versed in knowledge of schools, schooling and children. 

The first chapter of the book begins with an excellent overview 
by Robert Thorndike of the changes that have taken place during 
the past twenty years in the field of test and measurement. He 
places emphasis on the role of the computer and data processing 
in test development. Thorndike’s emphasis in this chapter is one 
of adequate data collection. He holds that test producers сап no 
longer justify shorteuts now that the rapid analyses features of 
computers are available to the test constructor. To this reviewer's 
taste, Thorndike’s discussion of the political and social problems 
of testing are only superficially summarized. The first section of 
this book covers Test Design, Construction, Administration and 
Processing. Chapter Two, Defining and Assessing Educational Ob- 
jectives, is a disappointing chapter. The authors focus clearly on 
the need of test constructors to define pupil behavior and necessity 
of attempting to maximize the probability that the test be a mea- 
sure of student learning. However, limited space was given to dis- 
cussion of the critical issue of values and how the test developer 
selects objectives, i.e., how does the developer attempt to meet 
societal or humanistic needs. Lindquist’s brilliant chapter in the 
1951 Educational Measurement (Preliminary Discussion in Ob- 
jective Test Construction) is quoted but his ideas are not developed 
e applied as fully as one would hope, given the concerns of the 
970's. 

Chapter Three, Planning the Objective Test, is a compendium of 
folk lore and unvalidated common sense assumptions. This is a very 
long chapter which could be greatly shortened by more concise 
writing. High points of the chapter are the editor's excellent note 
9n the problem of guessing and the author's discussion on practical 
issues of reliability. 
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Chapter Four, Writing the Test Item, should be useful to the clerk 
in a test production department. It contains many examples of 
items and the general suggestions for writing items are excellent. 
However, helpful though the chapter might be, it is unfortunately 
à composite of a multitude of unnecessary platitudes, i.e., “a good 
test is composed of well written items (81).” 

A more important criticism of this chapter is the author's sugges- 
tion to include test items to force teachers to attend to areas of 
the curriculum that they are ignoring even though the psycho- 
metric properties of the items may not be fully acceptable. This 
suggestion seems to this reviewer a very questionable procedure, 
both psychometrically and ethically. 

Chapters 5-8 concern Gathering, Analyzing and Using Data on 
Test Items, Reproducing the Test, Test Administration and Auto- 
mation of Test Scoring, Reporting and Analyses. They will be 
undoubtedly useful to the test department of a commercial firm or 
large school district. 

Chapter 6, Reproducing the Test, should be particularly useful to 
the ntl printer or those who give instructions to the 
printer, 

_ Part two of the book covers Special Types of Tests, Chapters 
include Performance and Product Evaluation, Essay Examinations, 


tion had read Cronbach, Jones and Davis, ineluded elsewhere in 
the volume more levity could be applied to the current state of 


The strongest section is section three, Measurement Theory, 
particularly Lyle Jones’s The Nature of Measurement, Stanley’s 
Reliability, and Cronbach's Test Validation. However in the same 
section, Angoff's otherwise excellent chapter on Scales, Norms, and 
Equivalent Scores is exhaustive beyond the point of relevancy and 
Cooley's Techniques for Considering Multiple Measurements is 


"strong" true score theory, however this is a minor point. 

Cronbach likewise writes clearly about complex issues. His 
presentation of construct validity is outstanding, He discusses his 
departure and agreement with Loevinger's earlier stand. He speaks 
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to the ultraoperationalist and non-behavioralist as well. He and 
Stanley have the ability to write simple statements that assist the 
reader center on the basic issues of test theory, i.e., “опе does not 
validate a test, but the interpretation of data arising from a specified 
procedure (447)." Cronbach is immensely quotable. 

Angoff's chapter is very comprehensive in scope and could be 
used in production endeavors. He includes a section on sampling 
and design for methods of equating test forms which should prove 
to be exceedingly useful to the test producer. Angoff's examples are 
frequently made to physieal measurements and the analogies are 
somewhat debatable. He makes the analogy between a scale of 
weights and a scale of typing ability. Although Angoff’s analogy 
is helpful in parts, applying the analogy to a multi-factored human 
skill is a debatable practice. 

Jones's chapter on The Nature of Measurement covers a breadth 
of topics usually covered in test theory courses. He is readable. 
Jones stresses appropriately that the necessary prerequisite before 
measurement can be made is to define the attribute in quantifiable 
terms that contain meaning. This is an excellent point and he holds 
to it throughout his chapter. However, when he gives a case study 
to clarify the conception and perception of attributes he uses the 
example of the case history of length rather than an example from 
an educational setting. Likewise his example in his excellent section 
on classification and rankings are drawn from other than education 
and psychology. That is this reviewer’s major critique of this chapter 
is of the tendency to use non-educational examples to make a point 
or to compare physical measurement scales for the purpose of de- 
veloping a frame of references for the state of the art of educational 
Measurement. Jones uses non-educational examples frequently in 
his otherwise excellent chapter. 

For example, Jones’s use of examples outside of education is enter- 
taining in the case history of length, irrelevant in the discussion 
of classification and rankings and distracting in the section of unit 
of measurement. Jones’s final section contains a discussion of Camp- 
bell’s classic theses of the 1920’s and the alternatives behavioral 
scientists have produced. Jones, in this section, returns to his major 
theme: it is the meaningfulness of the empirical counterpart which 
are important not what you call it. 

Section Four, Application of Tests to Educational Problems, is a 
strong section of what has yet to be a well developed field, i.e., 
evaluation. Glaser’s and Nitko’s discussion of the four activities 
of instructional design, Le. analyses of subject matter, diagnoses 
of characteristics of the learner, design of instructional environment 
and evaluation of learning outcomes is a good one and well de- 
veloped. Their discussion of instructional models are current and 
their discussion of norm versus criterion reference tests should be 
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useful to program evaluators. Davis's use of Measurement in Stu- 
dent Planning and Guidance is useful but should be read in con- 
junetion with Hill's chapter, Use of Measurement in Selection and 
Prediction which is largely aimed at college selecting. 

The Astin and Panos chapter, The Evaluation of Educational 
Programs, clearly defines evaluation as the collecting of informa- 
tion upon which to base a decision. However, in this reviewer's 
opinion they Superficially overview the field. In addition, when 
they do draw upon the developmental needs of children they draw 
upon Bruner’s 1961 work, Piaget's 1950 work and Erickson's 
1950, even though Bruner's circa 1961 position was inconsistent 
with Piaget’s circa 1950. 

This book has several chapters that will serve as а useful refer- 
ence both for advanced students in educational psychology, particu- 
larly those in advanced measurement courses. Rarely does the vol- 
ume contain new information for the professional. For the beginning 
test developer it will provide some useful hints and suggestions. Its 
strength is its weakness; it covers so much that users will have to 
be very selective in what they recommend to their students. The 
strong points have been mentioned above; its weakness is that 
there are all too few educational psychologists who specialize in 
measurement that can bring to bear a knowledge of children and 
curriculum. Isn’t it about time we in education demanded that the 
level of scholarship about measurement be matched with an equally 
high level of scholarship about children and schools? 


Nicuouas J. ANASTASIOW 
Institute for Child Study 
Indiana University 


Billy Turney and George Robb. Research in Education: An Intro- 
duction. Hinsdale, Ill.: The Dryden Press, 1971. pp. xi + 320. 
$6.95 (paperback), 


This book is essentially a “how to do research” text. It was written 
primarily for use in the initial course in research methodology by 
students of education. The authors Suggested that the book could 
also“... Serve as a useful reference for classroom teachers, counsel- 
ors, or administrators who are interested in doing research bui 
need a ‘refresher course?” (p. vii). Twelve chapters cover the re- 


Search process from selection of a Tesearchable problem to writing the 
final report. 
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Аз an introduction this chapter seems somewhat brief and frag- 
mented. However, most of its topies are dealt with at more length 
subsequently. 

"Selection and Evaluation of a Problem" is the topic of chapter 
2. The authors display good common sense in this section, and do 
well in treating this important aspect of research which does indeed 
often plague fledgeling researchers. 

In discussing “The Research Proposal” in chapter 3, Turney and 
Robb included the elements that one expects to find in a well- 
written proposal. However, the order of a proposal is presented as 
being quite rigid and nonadaptable. One should be allowed more 
freedom to taylor the form of a proposal to fit a specific problem 


. than the authors seem willing to permit. Throughout this chapter, 


relevant examples are used very effectively to illustrate points being 
made. 

The next two chapters concern the use of the library. Chapter 4 
is an adequate if tedious presentation of what is available ina 
library, while ehapter 5 handles library use. Again, suggestions 
sound more like prescriptions, such as the specification of the exact 
format one should use for recording reference information on 3 x 5 
cards. More emphasis could have been beneficially placed on the 
writing of nonpedestrian literature reviews. 

In chapter 6 the three types of research introduced in chapter 
1 are elucidated. The brevity with which the authors chose to 
write weakens this chapter. Descriptive research is limited pri- 
marily to surveys and case studies, with prediction studies, for 
example, never even mentioned. Experimental research is given a 
"once over lightly” focusing on field studies, field experiments, and 
independent and dependent variables. Designs containing other 
than one control group and one experimental group are neglected 
completely. 

The next chapter is entitled “Analysis and Treatment of Data.” 
Approximately 25 statistical topics are presented from frequency 
distributions and percentiles, through confidence intervals and t 
tests, to regression lines and Spearman’s rho. In condensing most 
of the topics found in a complete statistical textbook into 32 pages, 
frequent misinterpretations, omissions, symbols and concepts used 
without definitions, ete., resulted. More importantly, the possibility 
of converting a naive student into a reasonably sophisticated user 
of those 25 techniques in that brief a space is highly questionable. 
Given the restraints, emphasis would seem to have been better 
placed on the understanding of data, data analysis, and the use of 
statistics. A list of references could well have been included for 
Specific techniques. 

The dual themes of chapter 8, “Factors Affecting Research Re- 
Sulis," are what Cambell and Stanley (1963) have labeled internal 
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and external] validity. Both subjects are handled interestingly and 
adequately. In chapter 9, a long and nicely representative list of 
paper-and-pencil data gathering devices is offered. Included is а 
description of the use and interpretation, as well as the limitations, 
of each. The sections on limitations do tend to be somewhat 
pessimistic in that techniques which can overcome specific limita- 
tions go unmentioned. The most Surprising omission in this chapter 
is the absence of references to more extensive presentations of the 
various techniques. 

The subject of chapter 10 is “Computational Aids for the Re- 
searcher.” It is divided into two pages on desk calculators and nine 
pages on digital computers. Frankly, the section on computers is 
incredibly poor. About 90 per cent of this section concerns content 
which educational researchers do not need to know, and most 
probably do not really care about; namely, computer hardware. 
Furthermore, even the explanation of computer hardware is muddled. 
For example, core storage is described in terms of ferromagnetic 
rings and electrical charges of changing directions. However, the 
fact that all storage and processing is done in binary is kept secret. 
Bytes and words are mentioned, but bits are not. 

Most universities which support graduate research have a com- 
puter installation with user oriented “canned” programs which 
meet the needs of most researchers, especially beginners. Thus, this 
chapter on computers would seem more sensible if it contained some 
orientation to the workings of a computer facility as they relate to 
а user, and to the use of existing programs. This is the aspect of 
computers which is of vita] importance to the book’s intended 
audience. In spite of this, Turney and Robb devoted only two sen- 
one Program libraries and one to the workings of a computer 
acility. 

. Chapter 11 contains specific rules on the nitty-gritty of *Report- 
ing Educational Research,” such as the proper use of pronouns, 
footnotes, and references, Following the text of this chapter are 28 
Pages of examples which illustrate the preceding principles. Two 
aspects of this chapter are curious. First, no direct reference is 


Chapter 12 is the book’s greatest strength. A research proposal 
and a research report are Presented, and each is expertly critique 
by Linda Mitchell Crocker of the University of Florida. The 
pedagogical value of this section is moderated because each ex- 
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ample is presented in toto, followed by its critique. Were the specific 
comments of the critiquer to appear contiguous with the aspect 
being commented upon, possibly in а double-column presentation, 
the result would be much more effective. Also, the research proposal 
is criticized for not having a title when, in fact, it has; thus, indicat- 
ing some editorial carelessness. 

Overall, Research in Education: An Introduction by Turney and 
Robb can be described as brief, not very sophisticated, and super- 
ficial. Some parts are poorly planned and written. Therefore, use 
of this book as an exclusive text for a course is questionable. 
However, it could serve as a supplement, possibly to provide a stu- 
dent with a fairly quick overview of the research process. To com- 
mend it, the book is generally readable and contains many examples. 
The closing critiques are especially worthwhile. In sum, it is prob- 
ably fair to say that the authors attempted to present too many 
topics in too short a space. In trying to serve two masters, brevity 
and comprehensiveness, they tended to lose sight of their ultimate 
masters, students. 
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Recent evaluation mandates by federal legislation, criticisms 
from the public sector which suggest that educators have not ac- 
counted for the resources they have used up, teacher demands for a 
greater voice in school planning and administration, and the ap- 
pearance of а large number of instructional alternatives generated 
by curriculum projects, commercial developers and management 
consultants have all created an unmet need for methods for planning 
and conducting educational evaluation studies. With evaluation 
being the hot topic that it is today, a text with the title, The Evalua- 
on of Instruction must arouse great expectations on the part of 
the educational practitioner. Evaluation specialists do not have a 
Set of guidelines parallel to those provided to the research specialist 
by works such as the Campbell-Stanley treatment of experimental 
design. It is unfortunate that these expectations are not completely 
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met by the Wittrock and Wiley book, although some excellent logi- 
cal frameworks are provided. 

The text is primarily the product of a symposium sponsored by 
the UCLA Research and Development Center for the Study of 
Evaluation held at UCLA on December 13-15, 1967. The sym- 
posium papers and discussions are supplemented by four papers 
on causal models which were originally published elsewhere, The 
volume is divided into seven sections, 

The first section, an introduction by Wittrock, reflects one theme 
of the text when Wittrock argues that in evaluating instruction one 


are central to evaluation studies. The purpose of any evaluation 
study is to determine the worth of some phenomenon. With that 
purpose in mind the evaluator will identify a set of questions 
about the phenomenon to be answered by the study. Certain infor- 
mation heeds will follow directly from the questions. If the needed 


way feel obliged to estimate cause-and-effect relations. Many ex- 


Presentations of the symposium, argues skillfully that two primary 
criteria of measurement are distinctiveness (of measurement opera- 
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tions for any one inferred entity) and freedom from distortion 
(noise). The design of distinctive measurement must involve a two- 
stage operation. Anderson extended Gagné’s major points to em- 
phasize the importances of systematic analysis of test stimuli as 
compared to instructional stimuli. The Gagné and Anderson papers 
should be required reading for any student of psychological meas- 
urement. 

The fourth section contains a paper by Dan Lortie, and com- 
ments by C. Wayne Gordon and N. L. Gage. The section is en- 
titled “Contextual Variables.” Lortie presents a concise, but con- 
vincing argument about the diversity of evaluation roles that are 
resulting from innovation and large-scale organizational change. 
The paper is essential reading for all evaluation students. 

The fifth section, entitled “Criterion Variables,” includes papers 
by Samuel Messick and Marvin Alkin. Comments on the Messick 
paper are provided by Paul Blommers and Leonard Cahen and 
comments on the Alkin paper are given by Marvin Hoffenberg and 
John Bormuth. Messick provides a detailed argument for a focus 
on cognitive styles, and a lesser argument for a focus on affective 
reactions, in evaluation studies. It is unfortunate that a more syste- 
matic analysis of the problem of specifiying and measuring unin- 
tended outeomes of educational programs was not attempted, since 
the title of Messick’s paper suggested a more comprehensive dis- 
cussion. Alkin suggested a cost-effectiveness model as a tool for 
educational evaluators. The attempted model suggests a healthy 
Movement toward investigating the utility of techniques from 
disciplines (e.g., economies) other than psychology and applied 
mathematics for use in educational evaluation. 

The final section of the symposium, entitled “Methodological 
Issues,” contains papers by David Wiley and Martin Trow. Com- 
ments on the Wiley paper are provided by Chester Harris and 
Theodore Husek and comments on the Trow paper are given by 
Eugene Litwak and David Nasatir. Wiley’s paper, limited by a 
very narrow definition of evaluation, is most valuable for his dis- 
cussion of specific analysis techniques available for use by evalua- 
tors and researchers Trow discusses problems in evaluation design 
in higher education. 

The last section of the book, an appendix, contains two papers 
written by Herman О. A. Wold, a paper by Otis D. Duncan, and 
a paper by A. Н. Yee and N. L. Gage. The focus of the papers is 
toward teasing causal inferences out of nonexperimental data. This 
appendix was included, no doubt, to address the concerns contained 
in Wittrock’s introductory section of the text. 

‚ With eminent scholars, such as those listed above, participating 
in the UCLA symposium, it would be highly improbable that no 
major contributions would be recorded in the Wittrock-Wiley vol- 
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ume. The comments by Stake, Glass, and Scriven and the papers 
by Lortie and Alkin should be on all reading lists for evaluation 
courses. The paper by Gagné and the comments by Anderson, 
Postman, and Bormuth should be required reading for students of 
measurement. The paper by Wiley and those papers attached in 
the appendix of the volume are important readings for students of 
research design and analysis, measurement and evaluuation. 

The volume fails to meet its promise in that many issues and few 
answers are provided for the practicing evaluator and that the 
promises of the book title and section headings are never fulfilled. 

shortcomings are undoubtedly a function of the time at which 
the symposium was held. In 1967 much less had been written 
about the evaluation Process than today. As Chester Harris re- 


should be noted, however, that, even today, few answers exist to 
the many issues identified at the symposium. It would be very de- 
sireable to continue to conduct periodically symposia of the type 
recorded in the Wittrock-Wiley volume with published proceedings 
available within six months after the end of each symposium. New 
members to the group, with fresh ideas, and representing other 
disciplines in addition to Psychology and applied mathematics, 
could contribute greatly. 

There is no doubt that the volume will serve well as a reference 
in evaluation, statistics and research design, measurement, and 
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